AI engineering glossary

The terms that actually run in production.

Authoritative definitions of the AI engineering terms used in real enterprise systems. Maintained by the BearPlex engineering team. Updated as the field evolves.

50 termsAC D E F G H IK L MP Q R S TVZ

Agent

An agent is an AI system that perceives its environment, makes decisions, and takes actions to achieve goals, typically by using tools, executing multi-step ...

AI Agent

An AI agent is a language model-powered system that autonomously perceives its context, plans multi-step actions, calls tools or APIs, and iterates toward a ...

AI Alignment

AI alignment is the discipline of ensuring AI systems behave in ways consistent with human intentions and values: spanning training-time techniques (RLHF, Co...

AI Observability

AI observability is the practice of instrumenting and monitoring production AI systems to understand their behavior, detect issues, and continuously improve ...

AI Safety

AI safety is the multidisciplinary field focused on building AI systems that don't cause harm: spanning technical alignment research (making models do what w...

Attention Mechanism

The attention mechanism is a neural network component that lets each position in a sequence dynamically weight how much to attend to every other position (co...

Chain-of-Thought

Chain-of-Thought (CoT) is a prompting technique that encourages an LLM to articulate its reasoning step-by-step before producing a final answer: significantl...

Chunking

Chunking is the preprocessing step in RAG pipelines where source documents are split into smaller passages (typically 200-800 tokens each) before embedding a...

Constitutional AI

Constitutional AI (CAI) is Anthropic's alignment approach where a language model is trained to critique and revise its own responses according to a written s...

Context Window

A context window is the maximum number of tokens an LLM can process in a single inference call (including both the input prompt (system instructions, convers...

Dataset Curation

Dataset curation is the discipline of constructing, cleaning, labeling, and maintaining the datasets used to train, fine-tune, evaluate, and align AI systems...

Distillation

Knowledge distillation is a model compression technique where a smaller 'student' model is trained to mimic the outputs of a larger 'teacher' model: producin...

DPO

Direct Preference Optimization (DPO) is a fine-tuning method that aligns language models to human preferences directly from a dataset of preferred vs rejecte...

Embedding

An embedding is a numerical vector representation of text, image, or other data (typically a dense array of 384 to 4,096 floating-point numbers) that encodes...

Embedding Model

An embedding model is a neural network trained to convert text (or images, audio, or other content) into fixed-size dense numerical vectors that capture sema...

Evaluation Harness

An evaluation harness is the automated test infrastructure that measures LLM system quality across a representative set of inputs (combining held-out test da...

Few-Shot Learning

Few-shot learning is the technique of providing an LLM with a small number of input-output examples (typically 1-10) in the prompt to teach the model the des...

Fine-tuning

Fine-tuning is the process of continuing a pretrained language model's training on a smaller, task-specific dataset to adapt its behavior, style, or capabili...

FlashAttention

FlashAttention is an exact attention algorithm by Tri Dao that computes the same mathematical result as standard attention but with much better GPU memory ba...

Function Calling

Function calling is the LLM capability to generate structured JSON output that invokes external functions or APIs, enabling the model to read databases, call...

Guardrails

LLM guardrails are programmatic safety and quality checks applied to model inputs and outputs (including content filters, structured output validation, hallu...

Hallucination

Hallucination is the failure mode where a language model generates content that is fluent, confident-sounding, and incorrect: fabricating facts, citations, c...

Inference

Inference is the process of running a trained machine learning model on new input to generate output: for LLMs specifically, this means taking a prompt, proc...

Knowledge Graph

A knowledge graph is a structured representation of entities (people, products, places, concepts) and the relationships between them (typically stored in gra...

KV Cache

The KV Cache (key-value cache) is a memory structure used during LLM inference that stores the computed key and value matrices from previous tokens: enabling...

LoRA

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes the pretrained model weights and trains small low-rank matrices inject...

MCP

MCP (Model Context Protocol) is an open standard introduced by Anthropic in late 2024 that defines how AI assistants connect to external data sources and too...

Mixture of Experts

Mixture of Experts (MoE) is a neural network architecture where each forward pass routes tokens through a small subset of specialized 'expert' subnetworks ra...

Multi-Agent System

A multi-agent system (MAS) is an AI architecture where multiple specialized agents (each with its own role, tools, and prompt) coordinate to accomplish tasks...

PEFT

Parameter-Efficient Fine-Tuning (PEFT) is a family of fine-tuning techniques (including LoRA, QLoRA, prefix tuning, and prompt tuning) that update only a sma...

Prompt Engineering

Prompt engineering is the discipline of designing the inputs (instructions, context, examples, formatting cues) given to a language model to elicit reliable,...

Prompt Injection

Prompt injection is an attack technique where adversarial input causes an LLM to ignore its original instructions and follow new instructions embedded in use...

Quantization

Quantization is the process of reducing the numerical precision of a neural network's weights and activations (typically from 16-bit floating point (FP16/BF1...

RAG

RAG (Retrieval Augmented Generation) is an AI architecture that retrieves relevant documents from a knowledge base and injects them into a large language mod...

ReAct Pattern

ReAct is an LLM agent design pattern where the model alternates between Reasoning steps (thinking through what to do next) and Acting steps (calling tools or...

Reranking

Reranking is the second-stage scoring of candidate documents in a retrieval pipeline: using a more accurate but slower model (typically a cross-encoder) to r...

RLHF

RLHF (Reinforcement Learning from Human Feedback) is a post-training technique where human annotators rank model outputs by quality, those rankings train a r...

Scaling Laws

Scaling laws are empirical mathematical relationships discovered in deep learning research showing that model performance improves predictably as a power-law...

Self-Consistency

Self-consistency is an LLM reasoning technique where multiple chain-of-thought reasoning paths are sampled at temperature > 0 for the same problem, then the ...

Semantic Search

Semantic search is a retrieval technique that finds documents based on meaning rather than exact keyword match (converting both queries and documents to high...

Speculative Decoding

Speculative decoding is an LLM inference optimization where a small fast 'draft model' proposes multiple tokens, the large target model verifies them in a si...

Structured Output

Structured output is the LLM capability to generate output that conforms to a specific schema (typically JSON matching a defined structure with typed fields)...

System Prompt

A system prompt is a special message sent to an LLM at the start of a conversation that defines the model's persona, capabilities, constraints, and instructi...

Temperature

Temperature is a numerical parameter (typically 0.0 to 2.0) that controls randomness in LLM output sampling: lower values make the model more deterministic a...

Token

A token is the fundamental unit of text that LLMs process (typically a word, subword, or punctuation mark) produced by a tokenizer that splits raw text into ...

Tool Use

Tool use (also called function calling) is the capability of an LLM to invoke external functions, APIs, or other systems during inference: allowing the model...

Transformer

The Transformer is a neural network architecture introduced in the 2017 paper 'Attention Is All You Need' that uses self-attention mechanisms instead of recu...

Tree of Thoughts

Tree of Thoughts (ToT) is an LLM reasoning technique that explores multiple reasoning paths in parallel (branching at each reasoning step to consider alterna...

Vector Database

A vector database is a specialized database optimized for storing and querying high-dimensional embedding vectors: supporting fast nearest-neighbor search ac...

Zero-Shot Learning

Zero-shot learning is the ability of an LLM to perform a task without being given any examples (relying entirely on the task description in the prompt and th...