What Is Model Distillation? How Knowledge Transfer Makes AI Smaller & Faster

Model distillation (also called knowledge distillation) is a training technique where a smaller "student" model is trained to replicate the behavior and capabilities of a larger "teacher" model by learning from the teacher's output distributions rather than from raw training data alone. Instead of training the student directly on labeled data, distillation uses the teacher model to generate "soft targets" — probability distributions over all possible outputs — that encode richer information than simple correct/incorrect labels. The student learns not just what the right answer is, but how confident the teacher is and which alternative answers are plausible. This approach typically produces a student model that retains 90-95% of the teacher's quality at 10-20% of the size, enabling dramatic reductions in inference cost and latency.

Why it matters

Model distillation is the key technique for making frontier AI capabilities economically viable at production scale. Running a 200-billion-parameter frontier model costs 10-30× more per request than a 7-13 billion parameter model. For high-volume applications — customer support, document processing, content moderation — this cost difference makes frontier models financially impractical even when they deliver the best quality. Distillation bridges this gap: you use the frontier model as a teacher to train a smaller model that achieves near-frontier quality on your specific domain at a fraction of the ongoing cost. The economics are compelling — distillation training costs a one-time investment, after which every request for the life of the application runs at the smaller model's lower cost. Companies routinely report 70-80% reductions in per-request inference cost with less than 5% degradation in task performance.

How it works

Distillation proceeds in stages. First, the teacher model processes a large set of inputs and produces softened probability distributions (using an elevated temperature parameter that reveals the teacher's uncertainty patterns). Second, the student model is trained to match these soft distributions rather than just the final answers — this is the key insight that distinguishes distillation from simple fine-tuning. The soft targets carry information about relationships between possible answers: when a teacher assigns 60% probability to response A, 25% to response B, and only 1% to response C, the student learns that A and B are related valid responses while C is definitively wrong. This nuanced signal enables the student to generalize better than if it had only seen binary correct/incorrect labels. Modern LLM distillation often combines this approach with supervised fine-tuning on task-specific data and reinforcement learning from AI feedback (RLAIF), producing compact models that punch well above their weight class for specific domains.

Example

A logistics company processes 50,000 shipment-related customer inquiries per day using a frontier API model at €0.015 per request — €750 daily, €22,500 monthly. They distill a 7B-parameter model by having the frontier model process 200,000 historical inquiries, generating responses with soft probability distributions. The student model trains for 3 days on 8 GPUs (one-time cost: approximately €2,000). After distillation, the student handles 94% of inquiry types at the same quality as the teacher while running on 2 GPUs at €0.002 per request. They implement a routing layer that sends the 6% of complex edge cases to the frontier model. New monthly cost: €3,450 for the distilled model plus €900 for routed frontier requests — total €4,350 versus the previous €22,500. The distillation investment pays for itself in the first week.

Why it matters

How it works

Example

What Is Model Distillation?

Why it matters

How it works

Example

Sources

What Is Model Distillation?

Why it matters

How it works

Example

Sources