Engineering

Prompt engineering for production, not the chat window

Most prompt guides optimise for one-shot demos. Production prompts handle ambiguous input, partial failure, JSON edge cases, length limits, and tool-call recovery.

08 May 20269 min readKrypto Forge

Most "prompt engineering" content is about getting a clever answer once. Production prompts are about getting a correct, parseable answer ten thousand times in a row, including when the input is malformed, the API rate-limits, the user typed in Hindi instead of English, and the response is too long. That's a different discipline. This is what we use.

The shape of a production prompt

A prompt in production is a structured object, not a paragraph. We treat it like config. Versioned in git, reviewed in PR, tested in CI.

The skeleton:

[ROLE]
[SCOPE]
[INSTRUCTIONS]
[EXAMPLES]
[TOOLS]
[OUTPUT FORMAT]
[GUARDRAILS]
[USER TURN]

Each section does one job. Role and scope establish who the model is and what it's allowed to do. Instructions are the task. Examples are a few worked cases. Tools are the available actions. Output format is the schema. Guardrails are the failure modes. User turn is what changes per request.

A real (compressed) example from a customer-ops agent:

ROLE: You are a textile order intake assistant for Paraslace.
SCOPE: You read customer WhatsApp messages, identify the order, and draft a structured intake. You do not send messages. You do not commit orders. You produce a JSON object only.

INSTRUCTIONS:
- Identify customer (use the customer_id from the metadata).
- Extract items, quantities, units, delivery date.
- Quantities and units must be normalised to meters or pieces.
- Delivery date in ISO format. If ambiguous, prefer the next month-end.
- If you cannot determine an item, leave it as `unknown` and add a clarification question.

EXAMPLES:
[3 worked examples, including one with a voice note transcription and one in Hindi]

OUTPUT FORMAT:
{
  "customer_id": "...",
  "items": [{ "name": "...", "qty": <number>, "unit": "m" | "pcs", "unknown": <boolean> }],
  "delivery_date": "YYYY-MM-DD",
  "clarifications_needed": ["..."]
}

GUARDRAILS:
- If the message has no order intent, return { "no_order_intent": true }.
- If amount inferred is over ₹50,000, set `requires_review: true`.
- Never return text outside the JSON.

That's the shape. Nothing clever. Just every part of the contract written down.

The four things that go wrong in production

In order of how often we see them.

One: malformed JSON. The model adds a chatty preamble. Or a trailing comma. Or breaks on a deeply nested structure. Fix: always use the API's structured output mode (response_format, JSON mode, tool calls). Validate with a JSON schema on receipt. If parsing fails, retry once with a "your previous response was not valid JSON, here it is, fix it" prompt.

Two: refusals on legitimate input. The model refuses a perfectly normal request because something in the wording tripped a safety filter. Fix: pre-screen with a cheap model if your domain has known trigger words, or rephrase the prompt to make the legitimate intent obvious. Log refusals separately so you can spot patterns.

Three: length blow-out. The model produces a 4000-token response when you wanted 100. Fix: set max_tokens aggressively. Specify length in the prompt ("respond in 1-2 sentences"). Truncate gracefully on the client side.

Four: tool-call loops. The agent calls a tool, gets an error, retries, gets the same error, retries again. Fix: cap retries per tool per turn. After two failures, force the agent to summarise the situation to a human. We've covered this in the agentic AI post.

Examples beat instructions

Few-shot is the single most underused tool. Three good examples typically outperform a paragraph of instructions on the same task. Models are pattern-matchers, and examples are the most concentrated form of a pattern.

Three rules we use for examples:

Cover the boundary cases. One example of the easy case, one of the hard case, one of the failure case.
Use realistic data. Not "John Doe". Use the kind of names, dates, and quirks your real users produce.
Show the exact output format. If you want JSON, the example output is JSON. Don't describe it. Show it.

When we add examples to a flaky prompt, the failure rate often drops from 10-15% to under 2% without changing anything else.

Versioning is non-negotiable

Every production prompt in our codebase lives in a file with a version. Changes go through PR review. The version is logged on every API call alongside model name and parameters.

PROMPT_VERSION = "order_intake_v7"
SYSTEM_PROMPT = load_prompt("order_intake_v7.md")

result = client.messages.create(
    model="claude-sonnet-4-6",
    system=SYSTEM_PROMPT,
    messages=[{"role": "user", "content": user_message}],
    metadata={"prompt_version": PROMPT_VERSION, ...},
)

Two reasons. One, when behaviour changes in production on a Tuesday, you can attribute it to a specific change. Two, when you want to A/B a new prompt version, you have an honest comparison.

Without versioning, "the model is acting weird" is a debugging session that can take days. With versioning, it's a git log.

Treat prompts the way you treat schemas. Migrations have versions. So do prompts. The day you don't have this, the day you'll need it.

Eval before deploy

A prompt change without an eval is a regression waiting to happen. The eval doesn't have to be fancy. A spreadsheet of 30-50 representative cases with expected outputs gets you 80% of the value.

Our setup, simplified:

An evals/ folder with a few JSONL files. Each line is { "input": ..., "expected": ... }.
A test runner that calls the prompt against each input, scores the output against expected (sometimes with another LLM as judge, usually with rule-based checks).
The runner outputs a pass rate per case category.

Any prompt change has to either improve or hold the eval. If it regresses on a category, we know to look closer.

We don't claim our evals are exhaustive. We claim they catch the obvious regressions, and that's enough to ship without anxiety.

A few smaller habits

Things we do that have paid off, in no particular order:

System prompts come first, user content last. This maximises prompt cache hits when the API supports it (Anthropic's caching is the obvious one).
Plain markdown over fancy XML. Both work. Markdown is easier to read in PR review and the model handles it as well or better.
One prompt per task. Don't try to make one giant prompt do five jobs. Separate prompts for separate tasks scales better.
Negative examples sparingly. "Don't do X" is sometimes counterproductive. Better to say what to do, with examples.
Force confidence calibration when it matters. "If you're not sure, return confidence: low". Then route low-confidence to a human.

The honest takeaway

Production prompt engineering is mostly software engineering applied to natural language. Versioning, schemas, evals, fallback paths, observability. The "creative" bit is a smaller fraction of the work than the demos suggest.

If you're shipping LLM features and your prompts live in a Google Doc, that's where to start. Move them into git. Write three examples. Run them through an eval. The first time something changes and you can debug it in five minutes instead of five hours, the discipline pays itself back.

A prompt is a contract. Treat it like one.