What is Chain-of-Thought (CoT)?
Chain-of-Thought (CoT) is a prompting technique that encourages an LLM to articulate its reasoning step-by-step before producing a final answer: significantly improving performance on multi-step problems by making the model 'think out loud' rather than jumping directly to a conclusion. Modern frontier models (Claude Sonnet 4.5, GPT-5, Gemini 2.5) often do this automatically; explicit CoT prompting is most impactful with smaller or older models.
Overview
Chain-of-Thought was introduced in a 2022 Google paper (Wei et al.) that demonstrated dramatic improvements on math and reasoning benchmarks simply by prompting models to 'think step by step.' The technique exposed something fundamental: LLMs are better at producing accurate answers when they generate intermediate reasoning steps first, rather than predicting the final answer in one shot. The intuition: each generated token sets the context for the next, so producing a reasoning chain creates a kind of working scratchpad. By 2025-2026, CoT evolved into reasoning models like OpenAI's o1/o3 and Anthropic's extended thinking: models specifically trained to produce extensive internal reasoning before answering. Modern best practice: rely on models' built-in reasoning for simple tasks, prompt explicitly for CoT when the model isn't reasoning enough, and use reasoning models for the hardest problems.
How CoT works
The mechanism is simple: instead of asking 'What is 17 × 24?' you ask 'What is 17 × 24? Think step by step.' The model produces something like '17 × 24 = 17 × 20 + 17 × 4 = 340 + 68 = 408.' The intermediate steps both verify the work and establish context that makes the final number more accurate. Variants include: zero-shot CoT (just add 'think step by step'), few-shot CoT (provide examples of the reasoning pattern), self-consistency (sample multiple CoT chains and majority-vote the answer), and tree-of-thought (explore multiple reasoning paths in a tree structure).
Reasoning models: CoT as architecture
Reasoning models (OpenAI o1/o3 series, Claude with extended thinking, Gemini 2.5 with thinking) bake CoT into the architecture. Instead of relying on prompt engineering, these models produce extensive internal reasoning by default, often generating thousands of tokens of internal thinking before outputting the final answer. The internal thinking isn't shown to users by default but dramatically improves accuracy on complex problems (math olympiad, scientific reasoning, complex code). Trade-off: dramatically higher latency (10-60 seconds typical) and cost (5-100× more expensive than non-reasoning models). Use reasoning models for hard problems; non-reasoning models for routine tasks.
When CoT helps and when it doesn't
CoT helps significantly on multi-step math, complex reasoning, and tasks requiring careful analysis. CoT helps modestly on tasks where the model already knows the answer: generating reasoning is essentially redundant. CoT can hurt on simple tasks (overthinking introduces errors), tasks requiring creativity (constraining to step-by-step structure dampens novelty), and high-volume low-complexity tasks (cost overhead doesn't pay back). The rule of thumb: use CoT for problems where you, a human, would want to think step by step.
Use cases
- Math problem solving (arithmetic, word problems, multi-step computation)
- Logical reasoning and inference (formal logic, syllogisms, puzzles)
- Code debugging where the model needs to trace through logic
- Multi-document analysis where evidence from different sources combines
- Strategic decision frameworks where pros/cons analysis improves quality
- Diagnostic reasoning (medical, technical troubleshooting, root cause analysis)
Examples in production
Google Research (CoT paper)
The original Chain-of-Thought paper (Wei et al., 2022) demonstrated dramatic accuracy improvements on math and reasoning tasks simply by prompting models to 'think step by step.'
SourceOpenAI o1/o3 reasoning models
OpenAI's o1 and o3 models bake chain-of-thought into the architecture, producing extensive internal reasoning before answering. Significant accuracy gains on math, code, and scientific reasoning.
SourceAnthropic extended thinking
Anthropic's Claude with extended thinking allows configurable internal reasoning depth: trade latency and cost for accuracy on complex problems.
SourceDeepSeek R1
DeepSeek's R1 (open weights) replicated reasoning model performance with detailed CoT in the open-source ecosystem, demonstrating the technique works beyond closed labs.
SourceChain-of-Thought compared to alternatives
| Alternative | Choose Chain-of-Thought when | Choose alternative when |
|---|---|---|
Direct prompting (no CoT) Asking the model for an answer without requesting reasoning | CoT for multi-step problems, complex reasoning, or tasks where verification matters. | Direct prompting for simple tasks where the model knows the answer immediately: faster, cheaper. |
Tool use (calling a calculator or code executor) Letting the model invoke external tools for reliable computation | CoT when the reasoning itself is the value (analysis, judgment, complex inference). | Tool use for tasks where computation is the issue (math, code execution, data lookup): tools are dramatically more reliable than CoT for these. |
Common pitfalls
- Using CoT on simple tasks: overthinking can introduce errors. Save CoT for problems where step-by-step reasoning genuinely helps.
- Cost ignorance with reasoning models: o1/o3 and extended thinking can be 10-100× more expensive per call. Budget accordingly.
- Not validating intermediate steps: CoT chains can contain confident-sounding but wrong intermediate reasoning. The final answer's confidence reflects coherence of the chain, not correctness.
- Overusing self-consistency: sampling 10 CoT chains and majority-voting improves accuracy but multiplies cost. Use selectively.
- Showing internal reasoning to end users: extended-thinking output is optimized for the model, not for humans. Hide it or summarize before display.
Questions about Chain-of-Thought.
CoT prompting is a prompt-engineering technique that works on any LLM. Reasoning models (o1/o3, Claude extended thinking) are architecturally optimized for CoT: they produce thousands of tokens of internal reasoning by default. Reasoning models are 10-100× more expensive per call and 10-60× slower, but significantly more accurate on hard problems.
Reasoning models for the hardest problems where accuracy is paramount and you can tolerate latency/cost (scientific reasoning, complex code generation, hard math, strategic analysis). CoT prompting on standard frontier models for tasks that benefit from explicit reasoning but don't need full reasoning-model investment. Direct prompting (no CoT) for routine tasks.
Highly task-dependent. On math benchmarks (GSM8K, MATH), CoT can improve accuracy by 20-40 percentage points on smaller models, less on frontier. On reasoning benchmarks, improvements are typically 5-20 points. On simple tasks (classification, retrieval), CoT often doesn't help and can hurt. Always benchmark on your specific task.
Usually no: internal reasoning is optimized for the model, not human consumption. It's verbose, sometimes incoherent, occasionally contains incorrect intermediate steps. Either hide it entirely (most consumer products) or summarize it into user-friendly explanation (helpful for some debugging or transparency contexts).
Need help implementing Chain-of-Thought?
BearPlex builds production AI systems that use Chain-of-Thought for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.