Engineering

RAG, fine-tuning, or just better prompts: a 2026 decision tree

Most teams reach for RAG when prompting would do, and fine-tuning when RAG would do. The honest decision tree, with the cases where each actually wins.

10 May 202610 min readKrypto Forge

The three letters that get reached for too fast are RAG, FT (fine-tune), and "we need a bigger model". In 2026, with frontier models as capable as they are, the right answer is more often "better prompts and the right context window" than anyone in the room wants to admit.

This is the decision tree we use, in the order we use it.

Step zero: have you actually pushed prompting?

Before you reach for retrieval, fine-tuning, or a different model, ask whether you've actually tried prompting hard.

Hard prompting in 2026 means:

  • A detailed system prompt with role, scope, and constraints.
  • 3-5 worked examples (few-shot) covering the variants you actually see.
  • Structured output requirements (JSON schema, response format).
  • Explicit "if unsure, do X" guardrails.
  • Versioned in source control, with at least one eval run before you ship.

A surprising amount of "this needs RAG" turns out to be "this needs a better prompt and a one-shot example". We've seen teams burn weeks on a retrieval pipeline because nobody tried adding an example to the prompt.

If you haven't done the above, do that first. It's free.

Step one: do you have a knowledge corpus the model doesn't?

This is the RAG question, properly framed.

You need RAG when:

  • There's a body of documents, data, or knowledge the model wasn't trained on.
  • It's specific to your domain, customer, or product.
  • It's too large to fit in a single context window.
  • It changes over time, so you can't just stuff it in the system prompt.

You don't need RAG when:

  • The information is general enough that a frontier model already knows it (you'd be surprised what they know).
  • The corpus is small enough that it fits in context. Just paste it.
  • The "knowledge" is actually a rule set. Encode it as code or as a prompt section.

For most Indian SMB engagements, "do we need RAG" reduces to "do we have meaningful internal documents that aren't already represented in the database". Often the answer is no. The data is in Tally, in Postgres, in a spreadsheet. SQL or a tool call beats a vector search every time for structured data.

Step two: when RAG is the answer, do it properly

If RAG is the answer, the default in 2026 is hybrid retrieval (vector + keyword), reranking, and chunk hygiene. Pure vector RAG was a 2023 starter pattern and it stopped being state of the art a long time ago.

The minimum viable production RAG stack:

  • Chunking that respects structure. Headings, paragraphs, code blocks. Not naive 512-token sliding window. Bad chunking destroys retrieval quality more than any other single factor.
  • Embeddings from a current model. OpenAI's text-embedding-3-large or the equivalent.
  • Vector store matched to your scale. pgvector for most cases, Pinecone or Qdrant if you've outgrown it.
  • Keyword search alongside vector. BM25 over the same corpus. Hybrid retrieval consistently beats pure vector.
  • Reranker on top. Cohere's reranker or an open equivalent. This step alone often improves answer quality by 20-30%.
  • Citations. Every answer ties back to the chunks it used. Without citations, you cannot debug, and you cannot trust.

If your RAG doesn't have these pieces, the problem with your RAG is probably the pipeline, not the model.

The mistake we keep seeing: teams blame the LLM when retrieval is failing. They swap models, they try fine-tuning. The fix was a chunking change and a reranker. Always look at retrieval first.

Step three: when, and only when, fine-tuning earns its place

Fine-tuning a small model used to be the prestige move. In 2026, it's a specialist tool with a narrow but real role.

Fine-tuning is the right answer when:

  • Latency matters and a frontier API can't meet your budget. A fine-tuned small model running on your infrastructure can answer in 50ms where an API round-trip is 800ms. Real-time use cases, voice, transcription, on-device.
  • Cost is dominated by inference, not training. You're running millions of calls a month on a narrow task. Even a 5x reduction in per-call cost from a fine-tuned 8B model justifies the training run.
  • You have proprietary data and a tight task definition. Classification, extraction, structured generation on a specific domain. Generic models can do it; a fine-tuned model can do it better and faster, sometimes.
  • You need on-prem. Compliance, data residency, internal policy. Fine-tuning a self-hosted model becomes a real option when an API call isn't.

Fine-tuning is the wrong answer when:

  • The task is open-ended reasoning. Frontier models still beat fine-tuned smaller models on hard reasoning by a clear margin. Apple's research on small on-device models is excellent and explicitly does not claim to match Opus-class reasoning.
  • You have a few hundred examples. That's a prompting problem, not a training problem.
  • Your data changes weekly. Retraining loops are expensive operationally. RAG is more flexible.
  • "Hallucinations" are the symptom. Fine-tuning does not reliably reduce hallucinations. Better prompting, better retrieval, and structured outputs do.

A practical decision tree, in order

Walk it top-down. Stop at the first answer.

  1. Better prompts first. If you haven't versioned a careful prompt with examples and tested it, do that. Most "we need X" goes away here.
  2. More context. Modern models take huge contexts. If your corpus fits, just pass it. Cache the prompt.
  3. Tool use. If the answer lives in a database, API, or file, give the model a tool. Don't embed structured data.
  4. RAG, done well. Hybrid, reranked, cited. Only when (1)-(3) genuinely don't fit.
  5. Fine-tune a small model. Only when latency, cost, or sovereignty constraints force it, with a clear task definition and enough data.
  6. Train from scratch. Almost never. If you're considering this, you have a research project, not a product.

A worked example

A client wanted to "fine-tune a model on our product catalog so it can answer customer questions". 12,000 products, descriptions, specifications, FAQs.

Walking the tree:

  • Step 1: nobody had a careful prompt. We wrote one. Answer quality went from 60% to 75% on a small eval.
  • Step 2: 12,000 products didn't fit in context, but the categories did. We stuffed the category structure in the system prompt.
  • Step 3: gave the model a tool to look up a specific product by ID or fuzzy name. Now it could answer specific questions.
  • Step 4: built a small hybrid RAG over the descriptions for "find me a product like X" queries. With reranking.
  • Step 5 and 6: didn't need them.

Total build: about a week. Answer quality on the eval went from 60% to 92%. Fine-tuning was never the right answer.

This isn't a unique case. It's the most common case.

The honest summary

Fine-tuning is glamorous. Prompts and retrieval are boring. The boring stuff covers 80-90% of business AI work in 2026. The glamour cases exist (on-device, real-time voice, regulated data), but they are the minority.

Before reaching for the next letter in the acronym soup, walk the tree. Stop at the first honest answer. You'll ship faster and pay less.

The first question to ask about any model problem is "have we actually prompted it properly". Most of the time, the answer is no.

Tags

  • rag
  • fine-tuning
  • prompts
  • llm
  • decisions