
Quantization is the technique of reducing the numerical precision of a model's weights and activations — typically from 16-bit or 32-bit floating point to 8-bit or 4-bit integers — to shrink model size, reduce memory requirements, and accelerate inference speed. A 70B-parameter model stored in 16-bit precision requires approximately 140GB of GPU memory; quantized to 4-bit, it fits in roughly 35GB, making it runnable on consumer hardware that was previously insufficient. Quantization trades small amounts of model quality for dramatic improvements in efficiency, and modern quantization methods (GPTQ, AWQ, GGUF) achieve this with minimal quality loss — often less than 2% degradation on standard benchmarks.
Why it matters
Quantization is the primary technology that enables running large language models on affordable hardware. Without quantization, deploying a 70B model requires multiple enterprise-grade GPUs costing tens of thousands of dollars. With 4-bit quantization, the same model runs on a single GPU costing a fraction of that price. This has profound implications: it enables local LLM deployment for data privacy (no API calls leaving the organization), reduces inference costs for hosted deployments by 2-4×, and makes it feasible for researchers and small companies to experiment with large models. The combination of LoRA for efficient training and quantization for efficient inference has created a practical pathway for organizations to fine-tune and deploy custom LLMs on modest hardware budgets.
How it works
Quantization maps the continuous range of floating-point weight values into a smaller set of discrete integer values. In 4-bit quantization, each weight is represented by one of only 16 possible values (compared to 65,536 for 16-bit). The quantization process determines the optimal mapping that minimizes the difference between original and quantized outputs. Post-training quantization (PTQ) converts a trained model without additional training — methods like GPTQ analyze calibration data to find optimal quantization parameters per layer. Quantization-aware training (QAT) incorporates quantization during the training process itself, allowing the model to compensate for precision loss. Mixed precision keeps critical layers (typically the first and last layers) in higher precision while quantizing the bulk of the model. Inference speed improves because integer operations are faster than floating-point operations, and smaller weights require less memory bandwidth — often the true bottleneck in LLM inference.
Example
A healthcare startup needs to run a medical LLM on-premises for regulatory compliance — patient data cannot leave their network. The best open-source medical model has 70B parameters and requires 140GB in 16-bit — far beyond their two A100 40GB GPUs. Quantizing to 4-bit with GPTQ reduces the memory footprint to 35GB, fitting comfortably on a single GPU with room for the KV cache. They benchmark the quantized model against the full-precision version on their medical Q&A test suite: accuracy drops from 89.2% to 87.8% — a 1.4% reduction that their medical review board deems acceptable given the enormous cost savings. The quantized model also processes tokens 40% faster due to reduced memory bandwidth requirements, improving response time for the clinical decision support interface. Total hardware cost: two orders of magnitude less than what full-precision deployment would require.