Skip to main content
BVDNETBVDNET
ServicesWorkLibraryAboutPricingBlogContact
Contact
  1. Home
  2. AI Woordenboek
  3. Models & Architecture
  4. What Is LoRA (Low-Rank Adaptation)?
brainModels & Architecture
Intermediate

What Is LoRA (Low-Rank Adaptation)?

An efficient fine-tuning method that trains only small adapter layers instead of the full model

Also known as:
Low-Rank Adaptation
LoRA Fine-tuning
QLoRA
LoRA (Low-Rank Adaptation)

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that trains small adapter matrices alongside a frozen base model instead of updating all model weights. A typical LLM has billions of parameters, making full fine-tuning prohibitively expensive and requiring enormous GPU memory. LoRA inserts small rank-decomposed matrices (often just 0.1-1% of the original parameter count) into the model's attention layers. Only these adapter weights are trained, while the base model remains unchanged. At inference time, the adapter weights are merged into the base model at negligible cost. LoRA has become the standard approach for model customization because it reduces fine-tuning costs by 10-100× while achieving quality comparable to full fine-tuning.

Why it matters

LoRA democratized model customization. Before LoRA, fine-tuning a 70B-parameter model required multiple high-end GPUs and tens of thousands of dollars in compute. With LoRA, the same model can be fine-tuned on a single GPU in hours for under €100. This shifted fine-tuning from a capability reserved for well-funded AI labs to something any development team can do. LoRA also enables a powerful operational model: one base model with multiple LoRA adapters for different tasks or clients — legal analysis, medical Q&A, code review — each trained independently and swapped at serving time. This multiplied customization capability comes without multiplied infrastructure costs, since only the small adapter weights (typically 10-100MB) need to be stored and loaded per variant.

How it works

LoRA works by decomposing weight updates into low-rank matrices. Instead of updating a full weight matrix W (dimensions d×d, potentially millions of parameters), LoRA trains two small matrices A (d×r) and B (r×d), where r (the rank) is much smaller than d — typically 4, 8, or 16. The effective weight update is the product AB, which has the same dimensions as W but is parameterized by far fewer values. During training, only A and B are updated while W stays frozen. At inference time, the update AB is simply added to W, producing the final weights without any additional inference cost. The rank r controls the trade-off between adapter capacity and efficiency — higher ranks allow more complex adaptations but use more memory. QLoRA extends this further by quantizing the base model to 4-bit precision during training, reducing memory requirements to the point where a 65B-parameter model can be fine-tuned on a single consumer GPU.

Example

A consulting firm serves five industry verticals (healthcare, finance, legal, manufacturing, retail) and wants a specialized AI writing assistant for each. Full fine-tuning of their chosen 70B model would require five separate copies — 700GB of model weights and five expensive training runs. With LoRA, they train five adapters (rank 16, approximately 80MB each) on industry-specific writing samples. Total additional storage: 400MB. Each adapter trains in 4 hours on a single A100 GPU. At serving time, the base model stays loaded in memory while adapters are swapped per request based on the client's industry — no model reloading required. The firm delivers five specialized writing assistants for the infrastructure cost of one, and when the base model is updated to a new version, they simply retrain the lightweight adapters rather than repeating five full fine-tuning runs.

Sources

  1. Hu et al. — LoRA: Low-Rank Adaptation of Large Language Models
    arXiv
  2. Hugging Face PEFT — LoRA Conceptual Guide
    Web
  3. Wikipedia

Need help implementing AI?

I can help you apply this concept to your business.

Get in touch

Related Concepts

Fine-Tuning
Training a pre-trained LLM further on domain-specific data to specialize its behavior
RLHF (Reinforcement Learning from Human Feedback)
A training technique that uses human preference ratings to align LLM behavior with human values
Quantization
Reducing model weight precision from 16/32-bit to 8/4-bit to shrink size and speed up inference
Model Distillation
Training a smaller 'student' model to replicate a larger 'teacher' model's capabilities at a fraction of the cost and latency

AI Consulting

Need help understanding or implementing this concept?

Talk to an expert
Previous

Large Language Model (LLM)

Next

Model Context Protocol (MCP)

BVDNETBVDNET

Web development and AI automation. Done properly.

Company

  • About
  • Contact
  • FAQ

Resources

  • Services
  • Work
  • Library
  • Blog
  • Pricing

Connect

  • LinkedIn
  • GitHub
  • Twitter / X
  • Email

© 2026 BVDNET. All rights reserved.

Privacy Policy•Terms of Service•Cookie Policy