whitepaper

Fine-Tuning, RAG, or Prompting: An Engineering Decision

devinfo.dev — June 1, 2026

devinfo.dev:2026.0018

Fine-Tuning, RAG, or Prompting: An Engineering Decision

There are three levers for improving what an LLM produces: prompt engineering, retrieval-augmented generation (RAG), and fine-tuning. In practice, engineers reach for the wrong lever first — often fine-tuning when RAG would do, or prompting when neither is sufficient.

This paper maps each technique to the problem it actually solves, with benchmarks, cost analysis, and a decision framework.

---

The Core Distinction

Before choosing a technique, classify your problem:

  • Knowledge problem: The model doesn't know something — a document, a policy, a fact that changed after training.
  • Behavior problem: The model knows, but doesn't respond in the right format, style, tone, or domain-specific reasoning pattern.
  • Baseline problem: You haven't yet established what the model can do without modification.

This classification drives everything:

| Problem type | Correct technique |

|---|---|

| Baseline: unknown capability | Prompt engineering |

| Knowledge: stale, private, or dynamic facts | RAG |

| Behavior: format, tone, domain reasoning | Fine-tuning |

| Both knowledge and behavior | RAG + fine-tuning |

The most common engineering mistake: fine-tuning a model to know facts. Fine-tuning does not reliably inject factual knowledge — it trains the model to behave a certain way. When you fine-tune on factual data, you often get confident hallucinations on the very facts you tried to teach.

---

Prompt Engineering

What it solves: Establishing baseline capability. Defining task structure, persona, output format, and reasoning instructions.

Cost: Near-zero. No infrastructure. No GPU. Immediate iteration.

When to use:

  • You haven't measured what the base model can do.
  • You need fast iteration with no deployment overhead.
  • The task is well within the model's training distribution.

When it fails:

  • The information required is not in the model's training data.
  • The required output format or domain behavior is too far from the model's defaults to correct via instruction.

Rule: Always start here. Prompt engineering is not a consolation prize — it is the foundation every other technique builds on. A poorly prompted model with RAG or fine-tuning is worse than a well-prompted base model.

---

Retrieval-Augmented Generation (RAG)

What it solves: Knowledge problems. When the model needs to reason over documents, policies, product catalogs, codebases, or any information that was not in its training data or has since changed.

How it works: At inference time, relevant documents are retrieved from an external store (vector database, BM25 index, or hybrid) and injected into the context window alongside the user query. The model reads the retrieved content and generates a response grounded in it.

Cost: Moderate setup. Requires a document pipeline (chunking, embedding, indexing) and a retrieval layer. No GPU for training. Operational cost is primarily embedding inference and database queries.

Benchmark results:

Lewis et al. (2020), the original RAG paper from Facebook AI, demonstrated substantial improvements on open-domain question answering across three benchmarks versus standard LLM inference.

The CRAG benchmark (2024) showed LLM-only solutions achieve 34% accuracy on complex questions; straightforward RAG solutions reach 44%.

In "Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs" (EMNLP 2024), Mistral 7B scored 0.481 on knowledge-intensive tasks at baseline. Adding RAG moved that to 0.875. Fine-tuning alone moved it to 0.704. RAG clearly outperforms fine-tuning for knowledge injection.

A 2025 medical LLM study (PMC) comparing LLAMA and Mistral variants showed RAG achieved BLEU 0.163 versus 0.089 for fine-tuning alone — an 83% improvement for factual recall.

When to use:

  • Data changes frequently (daily, weekly).
  • Data is private and must not be included in training sets.
  • You need auditability — the ability to cite which source the model used.
  • The corpus is large and heterogeneous.
  • Users ask questions whose answers live in documents.

When it fails:

  • Retrieved chunks are too coarse or too fine (chunking strategy matters significantly).
  • Retrieval quality is low — the model gets the wrong context and confidently uses it. This is why retrieval is the weakest link in a RAG system, not the generation.
  • Latency is a hard constraint — every RAG call adds a retrieval round trip.

Retrieval quality matters more than model quality. A retrieval failure cannot be corrected downstream. The model will hallucinate confidently using whatever it was given. Invest in the retrieval layer first.

---

Fine-Tuning

What it solves: Behavior problems. When you need the model to reason differently, respond in a specific format consistently, adopt domain-specific terminology, or match a communication style — and prompt engineering cannot achieve this reliably at scale.

How it works: The model's weights are adjusted by continuing training on a curated dataset of (input, output) pairs representing the desired behavior. Modern fine-tuning almost always uses parameter-efficient methods (LoRA, QLoRA) rather than full weight updates.

Cost:

Full fine-tuning of a 70B model requires approximately 1 TB of GPU memory across optimizer states, gradients, weights, and activations (Holyk, 2024). This is impractical for most teams.

LoRA (Low-Rank Adaptation) fine-tuning changes only ~1% of model parameters via low-rank decomposition. Checkpoints are typically 8–50 MB versus gigabytes for full models. A 7B model can be LoRA fine-tuned on a single A100 80 GB GPU in hours. This is the practical standard for production fine-tuning in 2025–2026.

From Hu et al. (2021), the LoRA paper: LoRA achieves comparable or better task performance to full fine-tuning on many benchmarks while reducing trainable parameters by 10,000x on GPT-3.

When to use:

  • Output format must be consistent at high volume (structured JSON, domain-specific schemas).
  • Brand voice, communication style, or domain-specific reasoning patterns are non-negotiable.
  • You are deploying at high inference volume and want to use a smaller, faster model trained to behave like a larger one.
  • The task behavior is stable — your requirements are not changing frequently.
  • You have a clean dataset of at least 500–2,000 high-quality (input, output) examples.

When it fails:

  • You are trying to teach the model facts. Fine-tuning produces behavior changes, not reliable knowledge injection. Use RAG.
  • Your dataset is small and noisy. Fine-tuning amplifies data quality problems.
  • Your requirements change frequently. A fine-tuned model is a snapshot; updating it requires retraining.
  • You skip prompt engineering first. A fine-tuned model without a good system prompt still underperforms.

Anti-pattern: Fine-tuning for knowledge is the most common expensive mistake in LLM engineering. If you train a model on your company's FAQ, it will answer FAQ questions — until the FAQ changes. Then it hallucinates the old answers confidently.

---

Decision Framework

Use this in order. Do not skip steps.

Step 1: Measure baseline. Run the task with a carefully written prompt. No RAG, no fine-tuning. Measure quality.

Step 2: Diagnose failure. Is the model failing because it doesn't know something? Or because it's behaving wrong?

Step 3: If knowledge failure → RAG. Build a retrieval pipeline. Evaluate chunk quality. Measure retrieval recall before measuring generation quality.

Step 4: If behavior failure → Fine-tuning. Build a dataset. Use LoRA. Evaluate against the prompted baseline, not against a vague intuition.

Step 5: If both → RAG + fine-tune. Fine-tune for behavior. RAG for knowledge. Keep them separate.

Step 6: Iterate. None of these are one-shot solutions. Prompt engineering is an ongoing practice. RAG pipelines require chunking strategy tuning. Fine-tuned models require dataset curation.

---

Hybrid Systems

The 2024 consensus from practitioners: production systems rarely use one technique alone. A customer support system might fine-tune for tone and response structure, use RAG to retrieve current policy documents, and use a carefully engineered system prompt to orchestrate both.

This is not complexity for its own sake. Each component solves a distinct problem. The engineering discipline is keeping the boundaries clean:

  • The fine-tuned model governs behavior.
  • The RAG layer governs knowledge.
  • The prompt governs task framing and orchestration.

When these boundaries blur — when teams fine-tune because RAG feels complex, or RAG-ify what should be baked into behavior — quality drops and costs rise.

---

Long Context as an Alternative

A fourth option has emerged in 2025: stuffing everything into a long context window. Models like Gemini 1.5 Pro (1M tokens) and Claude (200K tokens) can ingest large corpora at inference time without retrieval.

For small, stable corpora, this is viable and simpler than building a retrieval pipeline. For large, dynamic, or private datasets, the cost per inference token becomes prohibitive and retrieval remains the correct architecture.

From the EMNLP 2024 paper on RAG vs. long-context LLMs: long-context LLMs outperform RAG when sufficiently resourced, but RAG maintains a significant cost advantage at scale.

---

Summary

| Technique | Problem solved | Cost | When to start |

|---|---|---|---|

| Prompt engineering | Baseline behavior | Near-zero | Always, first |

| RAG | Knowledge (dynamic/private) | Medium setup | When knowledge is the gap |

| Fine-tuning (LoRA) | Behavior (format/style/domain) | Medium compute | When behavior is the gap |

| Long context | Small stable corpora | High inference cost | When simplicity beats cost |

The decision is not about which technique is most sophisticated. It is about which problem you actually have.

Start with the prompt. Measure. Then choose.

References

1. Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. https://arxiv.org/abs/2005.11401

2. Hu, E. J., Shen, Y., Wallis, P., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. https://arxiv.org/abs/2106.09685

3. Rouzegar, H., & Makrehchi, M. (2024). Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs. EMNLP 2024. https://aclanthology.org/2024.emnlp-main.15.pdf

4. Yang, X., Sun, K., Xin, H., et al. (2024). CRAG – Comprehensive RAG Benchmark. arXiv:2406.04744. https://arxiv.org/pdf/2406.04744

5. Li, S., Chen, J., et al. (2024). Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. EMNLP 2024 Industry Track. https://aclanthology.org/2024.emnlp-industry.66.pdf

6. Hu, C., Xie, R., Gao, M., et al. (2024). RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture. arXiv:2401.08406. https://arxiv.org/abs/2401.08406

7. Lialin, V., Deshpande, V., & Rumshisky, A. (2023). Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning. arXiv:2303.15647. https://arxiv.org/pdf/2303.15647

8. Holyk, A. (2024). Fine-Tuning & Parameter-Efficient Adaptation. AI Compendium. https://www.alexholyk.com/6-llms/fine-tuning-parameter-efficient.html

9. IBM Research. (2025). RAG vs Fine-Tuning vs Prompt Engineering. IBM Think. https://www.ibm.com/think/topics/rag-vs-fine-tuning-vs-prompt-engineering

10. Medical LLMs study. (2025). Fine-Tuning vs. Retrieval-Augmented Generation for Medical LLMs. PMC/NCBI. https://pmc.ncbi.nlm.nih.gov/articles/PMC12292519/

Cite as

devinfo.dev. (2026). "Fine-Tuning, RAG, or Prompting: An Engineering Decision." devinfo.dev:2026.0018. https://devinfo.dev/d/2026.0018

devinfo.dev | https://devinfo.dev/d/2026.0018
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev