Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
Fine-tuning rewrites the model. LoRA does not.
LoRA — Low-Rank Adaptation — freezes every pre-trained weight and learns only a small correction: two narrow matrices, A and B, whose product approximates the change that full fine-tuning would have made.
The math is direct. A weight matrix W has dimensions d × k. Full fine-tuning computes ΔW — a d × k update — and adds it in. LoRA instead trains B (d × r) and A (r × k), where r is the rank, typically 1 to 64. The update becomes BA. When r is 4 and d is 4096, you have replaced 16 million parameters with 32,768. That is the efficiency.
At training time: You compute gradients only for A and B. The base weights are frozen — no gradient flows through them, no optimizer state accumulates for them. Memory and compute drop dramatically. The original paper reports 10,000× fewer trainable parameters for GPT-3 and a 3× reduction in required hardware compared to full fine-tuning.
At inference time: You merge. Before serving, multiply B × A and add it to W. The result is a single weight matrix — identical in shape to the original. No extra computation, no latency overhead. The deployed model is structurally indistinguishable from a fully fine-tuned one.
For deployment: Because the base model is untouched, you can store dozens of task-specific LoRA adapters as small files — kilobytes to low megabytes — and swap them without reloading the full model. One base, many behaviors.
They treat LoRA as a cheaper path to the same destination as full fine-tuning. It is not.
LoRA learns in a constrained subspace. The rank parameter r is a ceiling on expressive capacity. Low rank = fast, small, and limited. High rank = closer to full fine-tuning, but you are spending what you saved.
The right question is not "what rank should I use?" It is "how much of the weight space actually needs to change for this task?" For narrow behavioral changes — tone, format, domain vocabulary — low rank is sufficient. For teaching genuinely new capabilities, full fine-tuning may be unavoidable.
LoRA works because large language models are over-parameterized. Their meaningful weight changes live in a low-dimensional subspace. LoRA exploits that. When the task fits the assumption, the efficiency is real. When it does not, you will find out by watching the adapter underperform.
If you self-host models — Ollama, llama.cpp, vLLM — LoRA gives you something fine-tuning never could: a versioned, swappable behavioral layer that costs almost nothing to store and nothing to serve. The base model is shared infrastructure. The adapter is configuration.
That is a fundamentally different architecture than "run one fine-tuned model per task."
1. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685. https://arxiv.org/abs/2106.09685
2. Microsoft. LoRA: Official Implementation. GitHub repository. https://github.com/microsoft/LoRA
3. Fireworks AI. Understanding LoRA Performance. Fireworks AI Docs. https://docs.fireworks.ai/guides/understanding_lora_performance