
Inference is the process of running a trained Large Language Model to generate output from a given input. Every time you send a prompt to an LLM and receive a response, that is an inference operation. Inference is fundamentally distinct from training: training creates the model by learning from data (a one-time, expensive process), while inference uses the finished model to make predictions (a per-request, relatively inexpensive process). For most organizations working with LLMs, inference is the only phase they interact with — either through API calls to hosted models or by running open-source models on their own infrastructure. All LLM costs, latency, and throughput discussions center on inference performance.
Why it matters
Inference is where all LLM economics play out. When a company uses an LLM API, every request is an inference operation with a measurable cost in tokens processed, time taken, and compute consumed. Understanding inference economics — input vs. output token pricing, latency requirements (time to first token, time to complete), and throughput limits (requests per minute) — is essential for planning any AI deployment. The distinction between training and inference also clarifies important limitations: you cannot teach an LLM new facts through inference alone (prompting does not update model weights), and inference quality is bounded by what the model learned during training. This understanding drives architectural decisions like whether to fine-tune a model, implement RAG for knowledge updates, or simply improve prompts.
How it works
During inference, the LLM processes the input tokens through its neural network layers to generate output tokens one at a time. The input passes through the model's attention layers and feed-forward networks, producing a probability distribution over the entire vocabulary for the next token. The system selects a token from this distribution (influenced by temperature and other parameters), appends it to the sequence, and repeats. This autoregressive loop continues until the model produces a stop token or reaches the maximum output length. Inference speed is measured in tokens per second and is affected by model size, hardware (GPUs/TPUs), batch processing, and optimizations like KV caching (reusing attention computations from previous tokens) and quantization (reducing numerical precision for faster computation).
Example
A media company runs a content management pipeline processing 5,000 articles per day. Each article requires three inference calls: one for summarization (average 800 input + 200 output tokens), one for categorization (400 input + 50 output tokens), and one for SEO metadata generation (600 input + 150 output tokens). That totals 11 million tokens per day in inference. By profiling their inference pipeline, they discover that the categorization task (which uses only 50 output tokens) runs efficiently on a smaller, cheaper model, while summarization benefits from a frontier model's quality. Splitting inference across two model tiers — cheap model for classification, premium model for generation — reduces daily costs by 45% while maintaining quality where it matters most.