The terms thatactually run in production.
Authoritative definitions of the AI engineering terms used in real enterprise systems. Maintained by the BearPlex engineering team. Updated as the field evolves.
An agent is an AI system that perceives its environment, makes decisions, and takes actions to achieve goals, typically by using tools, executing multi-step ...
An AI agent is a language model-powered system that autonomously perceives its context, plans multi-step actions, calls tools or APIs, and iterates toward a ...
AI alignment is the discipline of ensuring AI systems behave in ways consistent with human intentions and values: spanning training-time techniques (RLHF, Co...
AI observability is the practice of instrumenting and monitoring production AI systems to understand their behavior, detect issues, and continuously improve ...
AI safety is the multidisciplinary field focused on building AI systems that don't cause harm: spanning technical alignment research (making models do what w...
The attention mechanism is a neural network component that lets each position in a sequence dynamically weight how much to attend to every other position (co...
Chain-of-Thought (CoT) is a prompting technique that encourages an LLM to articulate its reasoning step-by-step before producing a final answer: significantl...
Chunking is the preprocessing step in RAG pipelines where source documents are split into smaller passages (typically 200-800 tokens each) before embedding a...
Constitutional AI (CAI) is Anthropic's alignment approach where a language model is trained to critique and revise its own responses according to a written s...
A context window is the maximum number of tokens an LLM can process in a single inference call (including both the input prompt (system instructions, convers...
Dataset curation is the discipline of constructing, cleaning, labeling, and maintaining the datasets used to train, fine-tune, evaluate, and align AI systems...
Knowledge distillation is a model compression technique where a smaller 'student' model is trained to mimic the outputs of a larger 'teacher' model: producin...
Direct Preference Optimization (DPO) is a fine-tuning method that aligns language models to human preferences directly from a dataset of preferred vs rejecte...
An embedding is a numerical vector representation of text, image, or other data (typically a dense array of 384 to 4,096 floating-point numbers) that encodes...
An embedding model is a neural network trained to convert text (or images, audio, or other content) into fixed-size dense numerical vectors that capture sema...
An evaluation harness is the automated test infrastructure that measures LLM system quality across a representative set of inputs (combining held-out test da...
Few-shot learning is the technique of providing an LLM with a small number of input-output examples (typically 1-10) in the prompt to teach the model the des...
Fine-tuning is the process of continuing a pretrained language model's training on a smaller, task-specific dataset to adapt its behavior, style, or capabili...
FlashAttention is an exact attention algorithm by Tri Dao that computes the same mathematical result as standard attention but with much better GPU memory ba...
Function calling is the LLM capability to generate structured JSON output that invokes external functions or APIs, enabling the model to read databases, call...
A knowledge graph is a structured representation of entities (people, products, places, concepts) and the relationships between them (typically stored in gra...
The KV Cache (key-value cache) is a memory structure used during LLM inference that stores the computed key and value matrices from previous tokens: enabling...
MCP (Model Context Protocol) is an open standard introduced by Anthropic in late 2024 that defines how AI assistants connect to external data sources and too...
Mixture of Experts (MoE) is a neural network architecture where each forward pass routes tokens through a small subset of specialized 'expert' subnetworks ra...
A multi-agent system (MAS) is an AI architecture where multiple specialized agents (each with its own role, tools, and prompt) coordinate to accomplish tasks...
Parameter-Efficient Fine-Tuning (PEFT) is a family of fine-tuning techniques (including LoRA, QLoRA, prefix tuning, and prompt tuning) that update only a sma...
Prompt engineering is the discipline of designing the inputs (instructions, context, examples, formatting cues) given to a language model to elicit reliable,...
Prompt injection is an attack technique where adversarial input causes an LLM to ignore its original instructions and follow new instructions embedded in use...
RAG (Retrieval Augmented Generation) is an AI architecture that retrieves relevant documents from a knowledge base and injects them into a large language mod...
ReAct is an LLM agent design pattern where the model alternates between Reasoning steps (thinking through what to do next) and Acting steps (calling tools or...
Reranking is the second-stage scoring of candidate documents in a retrieval pipeline: using a more accurate but slower model (typically a cross-encoder) to r...
RLHF (Reinforcement Learning from Human Feedback) is a post-training technique where human annotators rank model outputs by quality, those rankings train a r...
Scaling laws are empirical mathematical relationships discovered in deep learning research showing that model performance improves predictably as a power-law...
Self-consistency is an LLM reasoning technique where multiple chain-of-thought reasoning paths are sampled at temperature > 0 for the same problem, then the ...
Semantic search is a retrieval technique that finds documents based on meaning rather than exact keyword match (converting both queries and documents to high...
Speculative decoding is an LLM inference optimization where a small fast 'draft model' proposes multiple tokens, the large target model verifies them in a si...
Structured output is the LLM capability to generate output that conforms to a specific schema (typically JSON matching a defined structure with typed fields)...
A system prompt is a special message sent to an LLM at the start of a conversation that defines the model's persona, capabilities, constraints, and instructi...
Temperature is a numerical parameter (typically 0.0 to 2.0) that controls randomness in LLM output sampling: lower values make the model more deterministic a...
A token is the fundamental unit of text that LLMs process (typically a word, subword, or punctuation mark) produced by a tokenizer that splits raw text into ...
Tool use (also called function calling) is the capability of an LLM to invoke external functions, APIs, or other systems during inference: allowing the model...
The Transformer is a neural network architecture introduced in the 2017 paper 'Attention Is All You Need' that uses self-attention mechanisms instead of recu...
Tree of Thoughts (ToT) is an LLM reasoning technique that explores multiple reasoning paths in parallel (branching at each reasoning step to consider alterna...