Skip to main content
BVDNETBVDNET
ServicesWorkLibraryAboutPricingBlogContact
Contact
  1. Home
  2. AI Woordenboek
  3. Models & Architecture
  4. What Is Quantization?
brainModels & Architecture
Intermediate

What Is Quantization?

Reducing model weight precision from 16/32-bit to 8/4-bit to shrink size and speed up inference

Also known as:
Quantization
Model Quantization
INT8
INT4
GPTQ
GGUF
Quantization

Quantization is the technique of reducing the numerical precision of a model's weights and activations — typically from 16-bit or 32-bit floating point to 8-bit or 4-bit integers — to shrink model size, reduce memory requirements, and accelerate inference speed. A 70B-parameter model stored in 16-bit precision requires approximately 140GB of GPU memory; quantized to 4-bit, it fits in roughly 35GB, making it runnable on consumer hardware that was previously insufficient. Quantization trades small amounts of model quality for dramatic improvements in efficiency, and modern quantization methods (GPTQ, AWQ, GGUF) achieve this with minimal quality loss — often less than 2% degradation on standard benchmarks.

Why it matters

Quantization is the primary technology that enables running large language models on affordable hardware. Without quantization, deploying a 70B model requires multiple enterprise-grade GPUs costing tens of thousands of dollars. With 4-bit quantization, the same model runs on a single GPU costing a fraction of that price. This has profound implications: it enables local LLM deployment for data privacy (no API calls leaving the organization), reduces inference costs for hosted deployments by 2-4×, and makes it feasible for researchers and small companies to experiment with large models. The combination of LoRA for efficient training and quantization for efficient inference has created a practical pathway for organizations to fine-tune and deploy custom LLMs on modest hardware budgets.

How it works

Quantization maps the continuous range of floating-point weight values into a smaller set of discrete integer values. In 4-bit quantization, each weight is represented by one of only 16 possible values (compared to 65,536 for 16-bit). The quantization process determines the optimal mapping that minimizes the difference between original and quantized outputs. Post-training quantization (PTQ) converts a trained model without additional training — methods like GPTQ analyze calibration data to find optimal quantization parameters per layer. Quantization-aware training (QAT) incorporates quantization during the training process itself, allowing the model to compensate for precision loss. Mixed precision keeps critical layers (typically the first and last layers) in higher precision while quantizing the bulk of the model. Inference speed improves because integer operations are faster than floating-point operations, and smaller weights require less memory bandwidth — often the true bottleneck in LLM inference.

Example

A healthcare startup needs to run a medical LLM on-premises for regulatory compliance — patient data cannot leave their network. The best open-source medical model has 70B parameters and requires 140GB in 16-bit — far beyond their two A100 40GB GPUs. Quantizing to 4-bit with GPTQ reduces the memory footprint to 35GB, fitting comfortably on a single GPU with room for the KV cache. They benchmark the quantized model against the full-precision version on their medical Q&A test suite: accuracy drops from 89.2% to 87.8% — a 1.4% reduction that their medical review board deems acceptable given the enormous cost savings. The quantized model also processes tokens 40% faster due to reduced memory bandwidth requirements, improving response time for the clinical decision support interface. Total hardware cost: two orders of magnitude less than what full-precision deployment would require.

Sources

  1. Hugging Face — Quantization Overview
    Web
  2. llama.cpp — Efficient LLM Inference in C/C++
    GitHub
  3. Dettmers et al. — LLM.int8(): 8-bit Matrix Multiplication
    arXiv
  4. Wikipedia

Need help implementing AI?

I can help you apply this concept to your business.

Get in touch

Related Concepts

AI Inference
The process of running a trained LLM to generate output from input
LoRA (Low-Rank Adaptation)
An efficient fine-tuning method that trains only small adapter layers instead of the full model
Token Economics
The pricing and cost structure of LLM usage based on token consumption
Model Distillation
Training a smaller 'student' model to replicate a larger 'teacher' model's capabilities at a fraction of the cost and latency
Scaling Laws for LLMs
Empirical patterns showing that LLM capabilities improve predictably as model size, training data, and compute increase — enabling reliable planning of AI investments

AI Consulting

Need help understanding or implementing this concept?

Talk to an expert
Previous

KV Cache

Next

Large Language Model (LLM)

BVDNETBVDNET

Web development and AI automation. Done properly.

Company

  • About
  • Contact
  • FAQ

Resources

  • Services
  • Work
  • Library
  • Blog
  • Pricing

Connect

  • LinkedIn
  • GitHub
  • Twitter / X
  • Email

© 2026 BVDNET. All rights reserved.

Privacy Policy•Terms of Service•Cookie Policy