
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that trains small adapter matrices alongside a frozen base model instead of updating all model weights. A typical LLM has billions of parameters, making full fine-tuning prohibitively expensive and requiring enormous GPU memory. LoRA inserts small rank-decomposed matrices (often just 0.1-1% of the original parameter count) into the model's attention layers. Only these adapter weights are trained, while the base model remains unchanged. At inference time, the adapter weights are merged into the base model at negligible cost. LoRA has become the standard approach for model customization because it reduces fine-tuning costs by 10-100× while achieving quality comparable to full fine-tuning.
Why it matters
LoRA democratized model customization. Before LoRA, fine-tuning a 70B-parameter model required multiple high-end GPUs and tens of thousands of dollars in compute. With LoRA, the same model can be fine-tuned on a single GPU in hours for under €100. This shifted fine-tuning from a capability reserved for well-funded AI labs to something any development team can do. LoRA also enables a powerful operational model: one base model with multiple LoRA adapters for different tasks or clients — legal analysis, medical Q&A, code review — each trained independently and swapped at serving time. This multiplied customization capability comes without multiplied infrastructure costs, since only the small adapter weights (typically 10-100MB) need to be stored and loaded per variant.
How it works
LoRA works by decomposing weight updates into low-rank matrices. Instead of updating a full weight matrix W (dimensions d×d, potentially millions of parameters), LoRA trains two small matrices A (d×r) and B (r×d), where r (the rank) is much smaller than d — typically 4, 8, or 16. The effective weight update is the product AB, which has the same dimensions as W but is parameterized by far fewer values. During training, only A and B are updated while W stays frozen. At inference time, the update AB is simply added to W, producing the final weights without any additional inference cost. The rank r controls the trade-off between adapter capacity and efficiency — higher ranks allow more complex adaptations but use more memory. QLoRA extends this further by quantizing the base model to 4-bit precision during training, reducing memory requirements to the point where a 65B-parameter model can be fine-tuned on a single consumer GPU.
Example
A consulting firm serves five industry verticals (healthcare, finance, legal, manufacturing, retail) and wants a specialized AI writing assistant for each. Full fine-tuning of their chosen 70B model would require five separate copies — 700GB of model weights and five expensive training runs. With LoRA, they train five adapters (rank 16, approximately 80MB each) on industry-specific writing samples. Total additional storage: 400MB. Each adapter trains in 4 hours on a single A100 GPU. At serving time, the base model stays loaded in memory while adapters are swapped per request based on the client's industry — no model reloading required. The firm delivers five specialized writing assistants for the infrastructure cost of one, and when the base model is updated to a new version, they simply retrain the lightweight adapters rather than repeating five full fine-tuning runs.