Where in the prompt should examples go?

After the task description, before the actual input the model is supposed to answer. Standard structure: [system prompt with task description] → [example 1: input → output] → [example 2: input → output] → ... → [actual input the model should answer]. Examples placed at the end of the few-shot block (closer to the actual input) tend to have slightly more influence than examples at the start.

Does few-shot work better than chain-of-thought?

They're complementary. Few-shot teaches format and pattern; chain-of-thought teaches reasoning structure. The most powerful technique is few-shot chain-of-thought: provide 3-5 examples that include the reasoning steps as well as the final answer. This reliably outperforms either technique alone on complex reasoning tasks.

Can few-shot examples leak into outputs?

Sometimes: the model occasionally regurgitates phrases from few-shot examples in its output, especially with smaller models. Frontier models in 2026 are mostly past this. If it happens, vary the surface form of examples and avoid using examples whose specific wording you don't want copied.

Start a conversation

AI engineering glossary

What is Few-Shot Learning in LLMs?

Few-shot learning is the technique of providing an LLM with a small number of input-output examples (typically 1-10) in the prompt to teach the model the desired pattern, format, or behavior: distinct from zero-shot (no examples) and fine-tuning (training on hundreds of examples).

Last updated 2026-04-28BearPlex AI Engineering Team

Overview

Few-shot learning is the most cost-effective way to improve LLM accuracy on specialized tasks. By showing the model 3-5 examples of the input format and desired output, you can often match or exceed fine-tuned model accuracy without any training. The technique was popularized by GPT-3 (the original paper was titled 'Language Models are Few-Shot Learners') and remains the dominant pattern for production prompt engineering. In our BearPlex production work, the typical iteration loop is: zero-shot first → measure failures → add 3-5 few-shot examples covering the failure modes → re-measure. This usually gets accuracy from 70-80% to 90%+ without any model training.

How few-shot learning works

Few-shot prompting works by exploiting the LLM's in-context learning capability. You construct a prompt with the structure: task description → example 1 (input + output) → example 2 (input + output) → ... → final input requesting the model's output. The model uses the examples to infer the desired output format, style, classification taxonomy, or reasoning pattern. Mechanistically, this works because the Transformer's attention mechanism learns 'induction heads' (Anthropic interpretability research) that implement pattern-matching: the model sees that examples follow input-output pattern X and then completes the final input following pattern X. The number of examples needed depends on task complexity, sometimes 1 example is enough, sometimes 10+ are needed for nuanced tasks.

How to construct effective few-shot prompts

Effective few-shot prompts follow several principles: (1) Diversity, examples should cover the range of inputs the model will see in production, including edge cases; (2) Quality over quantity: 3 high-quality examples often beat 10 mediocre ones; (3) Consistent format: every example must follow the exact same input/output structure, or the model gets confused; (4) Cover the failure modes: if the model fails on certain input types in zero-shot, include those types in the few-shot examples; (5) Order matters slightly: examples near the end of the prompt have somewhat more influence than examples at the start; put your most important examples last. (6) Test with held-out data: never evaluate few-shot performance on the examples you put in the prompt.

Few-shot vs fine-tuning trade-offs

Few-shot wins on speed (no training infrastructure, ship in hours), iteration cost (change examples, re-test, no retraining), and explainability (the prompt is the spec). Fine-tuning wins on per-call cost at scale (no example tokens consumed every call), specialized format compliance (fine-tuned models stick to format more reliably), and edge cases that prompting struggles to capture. The break-even point in our experience: fine-tuning becomes worth it when (a) you'd need 20+ examples to cover all failure modes (the prompt becomes unwieldy), (b) you serve millions of requests where the per-call example tokens cost real money, or (c) you need extremely consistent output format that prompting can't reliably achieve.

Use cases

Improving classification accuracy on borderline zero-shot tasks
Enforcing specific output formats (JSON schema, markdown structure, custom DSL)
Teaching the model domain-specific terminology or style
Demonstrating multi-step reasoning patterns the model should follow
Reducing hallucination on tasks where examples ground the desired output

Examples in production

OpenAI (GPT-3 paper, 2020)

Brown et al. demonstrated that GPT-3's accuracy improved dramatically with 1-shot, 5-shot, and 32-shot prompting across dozens of NLP benchmarks.

Source

Anthropic interpretability

Identified 'induction heads' as the specific Transformer attention pattern that implements in-context learning: the mechanism that makes few-shot learning work.

Source

BearPlex production engagements

Standard prompt engineering loop: zero-shot baseline → identify failure modes → add 3-5 few-shot examples covering failures → typically lifts accuracy from 70-80% to 90%+ without fine-tuning.

Few-Shot Learning compared to alternatives

Alternative	Choose Few-Shot Learning when	Choose alternative when
Zero-shot learning Task description only, no examples	Use few-shot when zero-shot accuracy is borderline or format is unusual	Use zero-shot when the task is clearly describable and base model handles it well
Fine-tuning Train the model on hundreds-to-thousands of examples	Use few-shot for fast iteration and tasks where 3-10 examples cover the space	Use fine-tuning when you need 20+ examples, serve high-volume traffic, or need rigid format compliance

Common pitfalls

Inconsistent example formatting: the model gets confused when examples differ in structure
Too few examples for complex tasks, sometimes 3 isn't enough for nuanced classification
Too many examples: past 10-15 examples, returns diminish and you're paying real token cost per call
Cherry-picked examples that don't represent the production distribution
Evaluating accuracy on the examples you used in the prompt: always use held-out test data

Related BearPlex services

Autonomous AI Agents Model Engineering & Fine-Tuning

Full AI glossary

FAQ

Questions about Few-Shot Learning.

Usually 3-7. Below 3, you don't have enough variety to teach the pattern. Above 10-15, returns diminish and per-call token cost becomes meaningful. Start with 3-5, add more only when you identify specific failure modes that examples would address.

Need help implementing Few-Shot Learning?

BearPlex builds production AI systems that use Few-Shot Learning for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.

Talk to BearPlex See case studies