
Temperature is a parameter that controls the randomness and creativity of a Large Language Model's output. It typically ranges from 0.0 to 2.0, where 0.0 produces completely deterministic output (always selecting the highest-probability token) and higher values introduce increasing randomness by flattening the probability distribution across possible next tokens. Temperature is one of the most impactful inference parameters that users can adjust without changing the prompt itself — it directly shapes whether output is predictable and focused or diverse and exploratory. Every major LLM API exposes temperature as a core parameter alongside top-p sampling and max tokens.
Why it matters
Temperature is the primary lever for balancing consistency versus creativity in AI applications. For tasks requiring reliability — structured data extraction, classification, code generation, factual Q&A — a temperature near 0.0 ensures the model produces the same output for the same input every time. For creative tasks — brainstorming, marketing copy, story writing — a temperature of 0.7-1.0 produces more diverse and surprising outputs. Choosing the wrong temperature can undermine an entire application: a customer support bot at temperature 1.0 gives inconsistent answers that erode trust, while a creative writing assistant at temperature 0.0 produces bland, repetitive text. For production systems, temperature is often the first parameter tuned after the prompt itself, and getting it right can improve perceived quality more than any prompt change.
How it works
Temperature modifies the probability distribution over the model's vocabulary before the next token is selected. During inference, the model produces raw scores (logits) for every possible next token. These logits are divided by the temperature value before passing through the softmax function, which converts them to probabilities. A low temperature (e.g., 0.1) divides by a small number, amplifying differences between logits — the highest-scoring token dominates with near-100% probability. A high temperature (e.g., 1.5) divides by a larger number, compressing differences — lower-probability tokens get a meaningful chance of being selected. At temperature 0.0, the model always picks the single highest-probability token (greedy decoding). Temperature interacts with other sampling parameters like top-p (nucleus sampling) and top-k, which further constrain which tokens are eligible for selection.
Example
A marketing agency uses the same LLM for two workflows. Their SEO metadata generator runs at temperature 0.1 — producing consistent, keyword-optimized titles and descriptions that pass automated quality checks without manual review. Their creative campaign brainstorming tool runs at temperature 0.9 — generating diverse tagline variations, unexpected angles, and novel metaphors that copywriters use as raw material. When they accidentally swapped the settings, the SEO tool started producing inconsistent metadata that broke their automated pipeline, while the brainstorming tool returned safe, generic suggestions that the creative team found useless. Restoring the correct temperature settings immediately fixed both workflows, demonstrating how this single parameter determines whether an LLM application succeeds or fails at its specific purpose.