What is Temperature in LLMs?
Temperature is a numerical parameter (typically 0.0 to 2.0) that controls randomness in LLM output sampling: lower values make the model more deterministic and conservative, higher values make it more creative and varied, with 0 producing nearly identical outputs for identical inputs and 1+ producing genuinely diverse responses.
Overview
Temperature is the most-tuned hyperparameter in production LLM systems because it directly controls the determinism vs creativity trade-off. Mathematically, temperature divides the logits (raw next-token scores) before the softmax that converts them to probabilities: at temperature 0 the model always picks the highest-probability token (greedy decoding), at temperature 1 it samples from the natural probability distribution, and at higher temperatures the distribution flattens so unlikely tokens become more probable. In production we set temperature 0 for tasks where consistency matters (data extraction, classification, function calling) and temperature 0.7-1.0 for tasks where variety helps (brainstorming, creative writing, marketing copy).
How temperature changes output
At temperature 0, the model is essentially deterministic: for any given input, it picks the same most-probable next token at every step, producing nearly identical outputs (some providers add tiny amounts of randomness even at temperature 0, but it's mostly stable). At temperature 0.5, the model usually picks high-probability tokens but occasionally samples something less expected. At temperature 1.0, the model samples from the model's natural probability distribution. At temperature 1.5+, low-probability tokens become much more likely, producing increasingly creative and eventually incoherent output. Most providers cap temperature at 2.0 because past that the output becomes random.
When to use which temperature
Use temperature 0 for: data extraction, classification, function calling, code generation where correctness matters more than creativity, RAG answer generation where the model must stick to retrieved facts. Use temperature 0.3-0.5 for: structured analysis with some flexibility, summarization, conservative chat assistants. Use temperature 0.7-1.0 for: creative writing, brainstorming, marketing copy, conversational chatbots where variety prevents repetitiveness, or when you want the model to consider multiple valid framings. Use temperature 1.0+ rarely: only when genuine surprise is desirable (creative writing, lateral-thinking exercises) and you can tolerate occasional incoherence.
Temperature vs top_p (nucleus sampling)
Top_p is a different sampling parameter that limits the model to sampling from the smallest set of tokens whose cumulative probability exceeds p (e.g., top_p=0.95 means sample only from the most likely tokens that together account for 95% of the probability). Temperature and top_p both control randomness but differently: temperature scales the entire distribution, while top_p truncates the long tail. Most providers recommend tuning one or the other, not both. In practice, temperature is more intuitive and most production systems just use temperature, leaving top_p at its default.
Use cases
- Setting temperature 0 for structured data extraction where consistency matters
- Setting temperature 0 for RAG answers to keep responses grounded in retrieved sources
- Setting temperature 0.7+ for creative writing, marketing copy, or brainstorming
- Running multiple inferences at temperature > 0 to gather diverse candidate answers
- Self-consistency techniques (sample N answers at temperature > 0, take the majority answer)
Examples in production
OpenAI Cookbook
Recommends temperature 0 for code generation, function calling, and data extraction tasks; temperature 0.7-1.0 for creative work.
SourceAnthropic
Claude API documentation explicitly notes temperature 0 for analytical and multiple-choice tasks; higher temperatures for creative writing.
SourceSelf-consistency research
Wang et al. (2022) showed that sampling multiple chain-of-thought reasoning paths at temperature > 0 and taking the majority answer significantly improves accuracy on math and reasoning benchmarks.
SourceTemperature compared to alternatives
| Alternative | Choose Temperature when | Choose alternative when |
|---|---|---|
Top_p (nucleus sampling) Limits sampling to the most-probable tokens that cover p% of the probability mass | Use temperature for global control of randomness: more intuitive and standard | Use top_p when you want to bound the worst-case randomness without scaling the whole distribution |
Greedy decoding Always pick the highest-probability next token (equivalent to temperature 0) | Use temperature > 0 when you want occasional variation | Use greedy decoding (temperature 0) for maximum reproducibility |
Common pitfalls
- Using high temperature for factual tasks: you'll get more creative, less correct answers
- Using temperature 0 for creative tasks: outputs become repetitive and stale
- Assuming temperature 0 = perfectly deterministic: small floating-point variations can still cause occasional differences
- Tuning both temperature and top_p simultaneously: pick one, leave the other at default
- Not testing your application across multiple temperature settings before locking it in
Related BearPlex services
Questions about Temperature.
Almost, but not perfectly. Floating-point arithmetic on GPUs has tiny non-determinism, and some providers add intentional small jitter even at temperature 0. For exact reproducibility you usually also need to fix the random seed (some APIs support a seed parameter) and use the same model version.
Highly creative or incoherent output: the model is sampling from an extremely flat distribution where unlikely tokens are nearly as likely as common ones. Useful for creative experimentation but rarely production-appropriate. Most production systems don't go above 1.0.
It doesn't change per-token cost. Temperature only affects which tokens are sampled, not how many. However, higher temperatures sometimes produce longer outputs (more verbose responses), which indirectly increases cost. If cost matters, set max output tokens explicitly.
Need help implementing Temperature?
BearPlex builds production AI systems that use Temperature for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.