Engineering

The cost-aware LLM pipeline: when to use Haiku, Sonnet, Opus

70% of LLM cost is wasted on calls that didn't need the smartest model. A working pattern for routing prompts to the right tier, with caching and graceful degradation.

11 May 20269 min readKrypto Forge

Most LLM bills are 3-5x what they need to be. The pattern is consistent: somebody picked the smartest model during a prototype, the prototype shipped, and the smart model is now answering questions that a cheap model would have nailed. The fix is not "use a cheap model". The fix is a router and a few habits.

This is the pipeline shape we use across our agent work and the one we recommend to clients before they panic about their Anthropic bill.

The model tiers, as of mid-2026

The Anthropic line-up is the cleanest example. Haiku, Sonnet, Opus. Three tiers with order-of-magnitude price differences and very different speed profiles. OpenAI, Google, Mistral each have an equivalent shape.

Roughly, at the time of writing:

  • Haiku 4.5. Cheap, fast, good for classification, extraction, simple summarisation, routing.
  • Sonnet 4.6. The workhorse. Strong reasoning, good tool use, code generation, document understanding. The default for production.
  • Opus 4.7. Deep reasoning. Multi-step planning, hard math, ambiguous judgement. Significantly more expensive and slower.

The price ratios matter more than the absolute numbers. Haiku is roughly a tenth the cost of Sonnet for the same input. Sonnet is roughly a fifth the cost of Opus. So a workflow that runs entirely on Opus will cost 50x what the same workflow could cost if you routed well.

The 70% claim

We don't have a citation for this so we'll be honest. In our own pipelines, before we paid attention, somewhere between 60% and 80% of calls were going to a tier above what they needed. Across roughly a dozen production agent stacks we've reviewed, the same was true. The 70% number is a working average from our own data, not a benchmark.

The reason it happens is mundane. Prototypes start on the smartest model because you want to find the ceiling of what's possible. Then everyone ships before circling back to optimise. The cheap model gets tried later, if at all.

The router pattern

The shape we use is a triage step at the front of every pipeline.

A Haiku-class model receives the user input, decides what kind of task this is, and routes to the right tier. The router prompt is small, cheap, fast, and only has to be roughly right.

def route(user_input: str) -> str:
    decision = haiku.complete(
        system=ROUTER_PROMPT,  # ~200 tokens
        user=user_input,
        max_tokens=20,
        response_format={"type": "json_object"},
    )
    return decision["tier"]  # "cheap" | "mid" | "deep"

if route(user_input) == "deep":
    return opus.complete(...)
elif route(user_input) == "mid":
    return sonnet.complete(...)
else:
    return haiku.complete(...)

This is roughly 20 lines of code, and it routinely cuts the bill by 60-80%. The router itself costs almost nothing because it's a Haiku call with a short prompt and a short output.

Three rules we use when building routers:

  • The router never tries to solve the task. It only classifies. Keep it stupid.
  • The router has a fallback: if the response isn't a valid JSON tier, default to the mid tier. Never fail open to the deep tier; that's how a bug eats a month's budget.
  • The router is logged separately. We sample its decisions weekly to make sure it isn't lazy-routing everything to "deep" because the prompt drifted.

Prompt caching is half the win

Anthropic's prompt caching, with a 5-minute TTL on cached blocks, is the other half of the cost story. Most production agents have a long system prompt (tool definitions, instructions, examples) and short user turns. Without caching, every turn re-pays for the full system prompt. With caching, the system prompt is paid once per cache window, and subsequent turns are cheap.

For an agent that gets used in bursts (a support agent during business hours, say), prompt caching can drop input cost by another 50-70%. Combined with routing, the total bill from the "smart model on everything" baseline can fall by an order of magnitude.

The discipline: structure your prompts so the stable, long part comes first and the variable part comes last. That way the cache key is consistent.

Graceful degradation

This is the bit nobody writes about and everyone needs. What happens when Opus rate-limits, or Sonnet times out, or the cheap tier returns garbage?

Our pattern:

  • Retry with backoff and jitter on transient failures. Standard SRE.
  • Tier-down on persistent failure. If Opus is unavailable, drop to Sonnet with a flag in the response so downstream code knows the answer is slightly worse. If Sonnet is unavailable, drop to Haiku.
  • Tier-up on quality failure. If the cheap tier returned malformed JSON twice in a row, re-route to the mid tier for that specific request, log it, sample it.
  • Hard cap. Every workflow has a max cost per execution. If it's exceeded, the workflow halts and pages a human, instead of running indefinitely.

Cost runaway is almost always a bug, not a usage spike. Set a hard cap. We've seen prompts go into infinite loops because of a malformed tool response that the model tried to "fix" by retrying. The cap saved the bill.

The numbers from one of our pipelines

A real example, anonymised. A customer-ops agent for a textile client. Before tuning:

  • All calls on Sonnet 4.6.
  • ~50,000 calls/month, average ~3,000 input tokens, ~500 output tokens.
  • Bill: about ₹52,000/month.

After routing, caching, and capping:

  • ~70% of calls on Haiku 4.5 (the triage and the simple extraction tasks).
  • ~25% on Sonnet (the actual reasoning and tool use).
  • ~5% on Opus (the genuinely hard escalations).
  • Prompt caching on the system prompt across all calls.
  • Bill: about ₹11,000/month, for the same workload.

About a 5x reduction with no measurable quality drop. The setup work was maybe two days.

What doesn't work

Three patterns we've tried and stopped using.

A "general router" that's another LLM trying to be clever. Just classify by task type. If the task type isn't well-defined, write a few examples and use them. Don't outsource the decision back to a model and pay twice.

Local models for cost reasons alone. A self-hosted Llama or Mistral can be cheap per token, but the operational cost of hosting, scaling, and matching the quality of a frontier API for typical SMB workloads usually doesn't favour local. We use local models for sovereignty reasons, not cost reasons.

Caching everything. Cache hits depend on stable prefixes. If your prompt has a per-user timestamp at the top, you're paying for caching with no benefit. Audit cache hit rates before assuming caching is helping.

The summary

Cost discipline isn't optional once you ship. The pattern is small: a cheap router, prompt caching, graceful degradation, hard caps. Apply it before you optimise the model. The router is worth more than the model choice.

If your LLM bill in 2026 is over ₹50,000/month and you haven't done this, the savings are sitting there. They're closer than you think.

The cheapest token is the one you didn't send.

Tags

  • llm
  • cost
  • claude
  • routing
  • caching