Skip to main content
BVDNETBVDNET
ServicesWorkLibraryAboutPricingBlogContact
Contact
  1. Home
  2. AI Woordenboek
  3. Models & Architecture
  4. What Is Model Distillation?
brainModels & Architecture
Intermediate

What Is Model Distillation?

Training a smaller 'student' model to replicate a larger 'teacher' model's capabilities at a fraction of the cost and latency

Also known as:
Knowledge Distillation
Kennisdistillatie
Teacher-Student Training
What Is Model Distillation? How Knowledge Transfer Makes AI Smaller & Faster

Model distillation (also called knowledge distillation) is a training technique where a smaller "student" model is trained to replicate the behavior and capabilities of a larger "teacher" model by learning from the teacher's output distributions rather than from raw training data alone. Instead of training the student directly on labeled data, distillation uses the teacher model to generate "soft targets" — probability distributions over all possible outputs — that encode richer information than simple correct/incorrect labels. The student learns not just what the right answer is, but how confident the teacher is and which alternative answers are plausible. This approach typically produces a student model that retains 90-95% of the teacher's quality at 10-20% of the size, enabling dramatic reductions in inference cost and latency.

Why it matters

Model distillation is the key technique for making frontier AI capabilities economically viable at production scale. Running a 200-billion-parameter frontier model costs 10-30× more per request than a 7-13 billion parameter model. For high-volume applications — customer support, document processing, content moderation — this cost difference makes frontier models financially impractical even when they deliver the best quality. Distillation bridges this gap: you use the frontier model as a teacher to train a smaller model that achieves near-frontier quality on your specific domain at a fraction of the ongoing cost. The economics are compelling — distillation training costs a one-time investment, after which every request for the life of the application runs at the smaller model's lower cost. Companies routinely report 70-80% reductions in per-request inference cost with less than 5% degradation in task performance.

How it works

Distillation proceeds in stages. First, the teacher model processes a large set of inputs and produces softened probability distributions (using an elevated temperature parameter that reveals the teacher's uncertainty patterns). Second, the student model is trained to match these soft distributions rather than just the final answers — this is the key insight that distinguishes distillation from simple fine-tuning. The soft targets carry information about relationships between possible answers: when a teacher assigns 60% probability to response A, 25% to response B, and only 1% to response C, the student learns that A and B are related valid responses while C is definitively wrong. This nuanced signal enables the student to generalize better than if it had only seen binary correct/incorrect labels. Modern LLM distillation often combines this approach with supervised fine-tuning on task-specific data and reinforcement learning from AI feedback (RLAIF), producing compact models that punch well above their weight class for specific domains.

Example

A logistics company processes 50,000 shipment-related customer inquiries per day using a frontier API model at €0.015 per request — €750 daily, €22,500 monthly. They distill a 7B-parameter model by having the frontier model process 200,000 historical inquiries, generating responses with soft probability distributions. The student model trains for 3 days on 8 GPUs (one-time cost: approximately €2,000). After distillation, the student handles 94% of inquiry types at the same quality as the teacher while running on 2 GPUs at €0.002 per request. They implement a routing layer that sends the 6% of complex edge cases to the frontier model. New monthly cost: €3,450 for the distilled model plus €900 for routed frontier requests — total €4,350 versus the previous €22,500. The distillation investment pays for itself in the first week.

Sources

  1. Hinton et al. — Distilling the Knowledge in a Neural Network
    arXiv
  2. Sanh et al. — DistilBERT: Smaller, Faster, Cheaper
    arXiv
  3. Wikipedia

Need help implementing AI?

I can help you apply this concept to your business.

Get in touch

Related Concepts

Fine-Tuning
Training a pre-trained LLM further on domain-specific data to specialize its behavior
LoRA (Low-Rank Adaptation)
An efficient fine-tuning method that trains only small adapter layers instead of the full model
Quantization
Reducing model weight precision from 16/32-bit to 8/4-bit to shrink size and speed up inference
RAG (Retrieval-Augmented Generation)
A technique that combines LLMs with external knowledge retrieval to improve accuracy and reduce hallucinations

AI Consulting

Need help understanding or implementing this concept?

Talk to an expert
Previous

Model Context Protocol (MCP)

Next

Multi-Tenancy in AI

BVDNETBVDNET

Web development and AI automation. Done properly.

Company

  • About
  • Contact
  • FAQ

Resources

  • Services
  • Work
  • Library
  • Blog
  • Pricing

Connect

  • LinkedIn
  • GitHub
  • Twitter / X
  • Email

© 2026 BVDNET. All rights reserved.

Privacy Policy•Terms of Service•Cookie Policy