Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
There are three levers for improving what an LLM produces: prompt engineering, retrieval-augmented generation (RAG), and fine-tuning. In practice, engineers reach for the wrong lever first — often fine-tuning when RAG would do, or prompting when neither is sufficient.
This paper maps each technique to the problem it actually solves, with benchmarks, cost analysis, and a decision framework.
---
Before choosing a technique, classify your problem:
This classification drives everything:
| Problem type | Correct technique |
|---|---|
| Baseline: unknown capability | Prompt engineering |
| Knowledge: stale, private, or dynamic facts | RAG |
| Behavior: format, tone, domain reasoning | Fine-tuning |
| Both knowledge and behavior | RAG + fine-tuning |
The most common engineering mistake: fine-tuning a model to know facts. Fine-tuning does not reliably inject factual knowledge — it trains the model to behave a certain way. When you fine-tune on factual data, you often get confident hallucinations on the very facts you tried to teach.
---
What it solves: Establishing baseline capability. Defining task structure, persona, output format, and reasoning instructions.
Cost: Near-zero. No infrastructure. No GPU. Immediate iteration.
When to use:
When it fails:
Rule: Always start here. Prompt engineering is not a consolation prize — it is the foundation every other technique builds on. A poorly prompted model with RAG or fine-tuning is worse than a well-prompted base model.
---
What it solves: Knowledge problems. When the model needs to reason over documents, policies, product catalogs, codebases, or any information that was not in its training data or has since changed.
How it works: At inference time, relevant documents are retrieved from an external store (vector database, BM25 index, or hybrid) and injected into the context window alongside the user query. The model reads the retrieved content and generates a response grounded in it.
Cost: Moderate setup. Requires a document pipeline (chunking, embedding, indexing) and a retrieval layer. No GPU for training. Operational cost is primarily embedding inference and database queries.
Benchmark results:
Lewis et al. (2020), the original RAG paper from Facebook AI, demonstrated substantial improvements on open-domain question answering across three benchmarks versus standard LLM inference.
The CRAG benchmark (2024) showed LLM-only solutions achieve 34% accuracy on complex questions; straightforward RAG solutions reach 44%.
In "Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs" (EMNLP 2024), Mistral 7B scored 0.481 on knowledge-intensive tasks at baseline. Adding RAG moved that to 0.875. Fine-tuning alone moved it to 0.704. RAG clearly outperforms fine-tuning for knowledge injection.
A 2025 medical LLM study (PMC) comparing LLAMA and Mistral variants showed RAG achieved BLEU 0.163 versus 0.089 for fine-tuning alone — an 83% improvement for factual recall.
When to use:
When it fails:
Retrieval quality matters more than model quality. A retrieval failure cannot be corrected downstream. The model will hallucinate confidently using whatever it was given. Invest in the retrieval layer first.
---
What it solves: Behavior problems. When you need the model to reason differently, respond in a specific format consistently, adopt domain-specific terminology, or match a communication style — and prompt engineering cannot achieve this reliably at scale.
How it works: The model's weights are adjusted by continuing training on a curated dataset of (input, output) pairs representing the desired behavior. Modern fine-tuning almost always uses parameter-efficient methods (LoRA, QLoRA) rather than full weight updates.
Cost:
Full fine-tuning of a 70B model requires approximately 1 TB of GPU memory across optimizer states, gradients, weights, and activations (Holyk, 2024). This is impractical for most teams.
LoRA (Low-Rank Adaptation) fine-tuning changes only ~1% of model parameters via low-rank decomposition. Checkpoints are typically 8–50 MB versus gigabytes for full models. A 7B model can be LoRA fine-tuned on a single A100 80 GB GPU in hours. This is the practical standard for production fine-tuning in 2025–2026.
From Hu et al. (2021), the LoRA paper: LoRA achieves comparable or better task performance to full fine-tuning on many benchmarks while reducing trainable parameters by 10,000x on GPT-3.
When to use:
When it fails:
Anti-pattern: Fine-tuning for knowledge is the most common expensive mistake in LLM engineering. If you train a model on your company's FAQ, it will answer FAQ questions — until the FAQ changes. Then it hallucinates the old answers confidently.
---
Use this in order. Do not skip steps.
Step 1: Measure baseline. Run the task with a carefully written prompt. No RAG, no fine-tuning. Measure quality.
Step 2: Diagnose failure. Is the model failing because it doesn't know something? Or because it's behaving wrong?
Step 3: If knowledge failure → RAG. Build a retrieval pipeline. Evaluate chunk quality. Measure retrieval recall before measuring generation quality.
Step 4: If behavior failure → Fine-tuning. Build a dataset. Use LoRA. Evaluate against the prompted baseline, not against a vague intuition.
Step 5: If both → RAG + fine-tune. Fine-tune for behavior. RAG for knowledge. Keep them separate.
Step 6: Iterate. None of these are one-shot solutions. Prompt engineering is an ongoing practice. RAG pipelines require chunking strategy tuning. Fine-tuned models require dataset curation.
---
The 2024 consensus from practitioners: production systems rarely use one technique alone. A customer support system might fine-tune for tone and response structure, use RAG to retrieve current policy documents, and use a carefully engineered system prompt to orchestrate both.
This is not complexity for its own sake. Each component solves a distinct problem. The engineering discipline is keeping the boundaries clean:
When these boundaries blur — when teams fine-tune because RAG feels complex, or RAG-ify what should be baked into behavior — quality drops and costs rise.
---
A fourth option has emerged in 2025: stuffing everything into a long context window. Models like Gemini 1.5 Pro (1M tokens) and Claude (200K tokens) can ingest large corpora at inference time without retrieval.
For small, stable corpora, this is viable and simpler than building a retrieval pipeline. For large, dynamic, or private datasets, the cost per inference token becomes prohibitive and retrieval remains the correct architecture.
From the EMNLP 2024 paper on RAG vs. long-context LLMs: long-context LLMs outperform RAG when sufficiently resourced, but RAG maintains a significant cost advantage at scale.
---
| Technique | Problem solved | Cost | When to start |
|---|---|---|---|
| Prompt engineering | Baseline behavior | Near-zero | Always, first |
| RAG | Knowledge (dynamic/private) | Medium setup | When knowledge is the gap |
| Fine-tuning (LoRA) | Behavior (format/style/domain) | Medium compute | When behavior is the gap |
| Long context | Small stable corpora | High inference cost | When simplicity beats cost |
The decision is not about which technique is most sophisticated. It is about which problem you actually have.
Start with the prompt. Measure. Then choose.
1. Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. https://arxiv.org/abs/2005.11401
2. Hu, E. J., Shen, Y., Wallis, P., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. https://arxiv.org/abs/2106.09685
3. Rouzegar, H., & Makrehchi, M. (2024). Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs. EMNLP 2024. https://aclanthology.org/2024.emnlp-main.15.pdf
4. Yang, X., Sun, K., Xin, H., et al. (2024). CRAG – Comprehensive RAG Benchmark. arXiv:2406.04744. https://arxiv.org/pdf/2406.04744
5. Li, S., Chen, J., et al. (2024). Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. EMNLP 2024 Industry Track. https://aclanthology.org/2024.emnlp-industry.66.pdf
6. Hu, C., Xie, R., Gao, M., et al. (2024). RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture. arXiv:2401.08406. https://arxiv.org/abs/2401.08406
7. Lialin, V., Deshpande, V., & Rumshisky, A. (2023). Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning. arXiv:2303.15647. https://arxiv.org/pdf/2303.15647
8. Holyk, A. (2024). Fine-Tuning & Parameter-Efficient Adaptation. AI Compendium. https://www.alexholyk.com/6-llms/fine-tuning-parameter-efficient.html
9. IBM Research. (2025). RAG vs Fine-Tuning vs Prompt Engineering. IBM Think. https://www.ibm.com/think/topics/rag-vs-fine-tuning-vs-prompt-engineering
10. Medical LLMs study. (2025). Fine-Tuning vs. Retrieval-Augmented Generation for Medical LLMs. PMC/NCBI. https://pmc.ncbi.nlm.nih.gov/articles/PMC12292519/