What Most GenAI Evaluation Workflows Get Wrong
Reliability is not a property of the final answer. It is a property of the whole system that produced it.
I keep seeing the same mistake. A team hooks up a model, eyeballs a handful of outputs, decides it looks fine, and ships. Evaluation only enters the picture after something breaks in production, and by then the team is firefighting, bolting on test cases one incident at a time, always a step behind.
The issue is not that they forgot to evaluate. It is that they put the eval in the wrong place. They scored the final answer without looking at what produced it. But in any system that retrieves documents, routes intents, calls tools, or chains reasoning steps, the answer is the last place a bug shows up—and almost never where it starts. A bad retrieval, a misrouted query, or a wrong tool argument can all produce something that reads beautifully and is completely wrong.
Evaluation belongs inside the workflow, not after it. GenAI evals should look less like one big pass/fail test and more like layered quality gates.
This is not just my opinion. Official guidance from OpenAI, Anthropic, Google Cloud, Microsoft Foundry, LangSmith, and NIST all point the same way: evaluate routing, retrieval, tool use, grounding, safety, and drift—not just the final paragraph.
Why the Final Answer Is Usually Not the Root Cause
When something goes wrong, the reflex is to blame the model. But in most real-world setups, the model is just one piece of a longer pipeline. There are at least five places where things can quietly go sideways before the answer even gets generated:
• Prompt construction and intent routing
• Document retrieval and context assembly
• Tool selection and parameter extraction
• Agent handoff and orchestration logic
• Post-processing and output formatting
Every major platform now says the same thing. OpenAI recommends scoped tests at every stage. Microsoft says you need to evaluate each step, not just the final output. Google separates response evaluation from trajectory evaluation. Ragas breaks RAG scoring into retrieval metrics and generation metrics. The consensus is clear.
Outcome-only evaluation is a scoreboard. Workflow-native evaluation is an instrument panel.
End-to-end success rates still matter as the top-level KPI—but they are terrible debugging tools. Anthropic puts it well: agent mistakes compound, so you need to inspect every intermediate step to figure out whether the failure came from the model, the tooling, the harness, or the eval itself.
Where Evaluation Actually Belongs
The most useful eval setups start by breaking the system into stages and writing checks that match what each stage is supposed to do. Here are the five checkpoints that matter:
• Input & Routing — Did we understand the user’s intent? Did we catch injection attempts? Did we route to the right handler?
• Retrieval & Context — Did we pull the right documents? Are they relevant? Is the context complete enough to answer the question?
• Tool Use & Planning — Did the agent pick the right tool? Are the arguments valid? Did it follow a reasonable path to get there?
• Generation — Is the answer correct, complete, and grounded in the evidence? Is it safe? Does it know when to say “I don’t know”
• Production Monitoring — How fast is it? How much does it cost? Is quality drifting? Are users flagging problems?

If this sounds like a lot of work, it does not have to be, at least not on day one. The practical move is risk-weighted decomposition. Pick the two or three user journeys that matter most, write clear success criteria for each, and start with a small curated dataset. Anthropic, OpenAI, and LangSmith all recommend beginning with 10-20 high-quality examples and growing from real failures. That keeps things lightweight while still making problems visible.
Picking Metrics That Match How Things Actually Break
The golden rule: use the cheapest reliable check first. Start with deterministic graders: exact match, schema validation, regex, policy checks, citation lookups. Only bring in LLM judges when you need to score something fuzzy like helpfulness or completeness.
What to track depends on your system type:
RAG systems: context precision, context recall, grounding, faithfulness. Track retrieval and generation as separate scores.
Agent systems: tool selection accuracy, argument validity, task completion rate, trajectory precision, and recall.
Any system: correctness, completeness, safety, abstention quality. Keep these as separate dimensions—one overall number hides too much.
A warning about LLM-as-judge: yes, LLM judges can scale evaluation and hit human-level agreement on some tasks. But research (MT-Bench, the “Large Language Models are not Fair Evaluators” paper) shows they are prone to position bias, verbosity bias, and self-enhancement bias. Just swapping the order of two responses can flip the ranking. The fix is not to avoid LLM judges; it is to calibrate them. Use pairwise comparisons, pass/fail rubrics with clear criteria, and periodically check agreement against human labels. Targeted human calibration beats full human scoring.
A Six-Step Playbook for CI and Production
Evals that live only in notebooks are evals that rot. Here is how to make them part of your development and deployment loop:
Instrument traces early. From the first real prototype, capture everything: input, system prompt, routing decision, retrieved docs, tool calls, results, final answer, latency, cost, feedback. Without traces, you can evaluate only the output, not the workflow.
Start with a small, sharp dataset. 10-20 curated examples per critical path. Include happy paths, edge cases, known failure modes, and adversarial inputs. Quality beats quantity here.
Layer your graders. Deterministic checks first (schema validation, regex, policy filters). LLM judges second (for nuanced criteria like completeness). Sampled human review third (for calibration and high-stakes cases).
Wire evals into CI/CD. When prompts, retrievers, tool schemas, or orchestration logic change, run component-level and end-to-end evals automatically. Fail the build if pass rates drop.
Add online evaluation after deploy. Score a sample of production traces for quality, safety, and cost. Route low scores and negative feedback into annotation queues. Feed confirmed failures back into your offline dataset.
Schedule security scans. Red-team prompts, poisoned-context tests, and injection probes should run on a regular cadence, not just at deploy time.
This is the pattern you will find across LangSmith, MLflow, Microsoft Foundry, and Promptfoo. It is where evaluation stops being a one-off check and becomes a continuous part of how the system runs.

What Happens When Evals Only Check the Surface
Air Canada’s chatbot (Moffatt v. Air Canada). A customer asked about bereavement fares and got incorrect guidance from the airline’s chatbot. Air Canada tried to distance itself from the bot’s answer, but the tribunal was not having it—the chatbot was part of the company’s website, and the airline was responsible for what it said.
This was not a hallucination problem. It was a missing-controls problem. A workflow-level eval would have checked chatbot responses against the actual policy documents, flagged contradictions, required evidence for refund-related answers, and routed anything ambiguous to a human. A style-level review of the output would have missed all of this—the answer probably read just fine.
Fabricated case law (Mata v. Avianca). Lawyers submitted court filings with fake case citations generated by ChatGPT. The court sanctioned them, ordered letters of explanation to the judges who were falsely cited, and imposed a $5,000 fine.
The output sounded professional. The problem is that nobody checked whether the cited cases actually existed. For any system that conducts research or cites sources, evaluation needs to cover source resolution, citation validity, and handling missing evidence. A confident answer with fabricated support should count as a failure, even if it scores well on fluency.
Conclusion
Better prompts and better models help. But what actually makes GenAI systems reliable is treating evaluation as part of the workflow; not something you bolt on at the end. Checkpoints before generation, around tool use, before delivery, and after deployment through tracing and feedback.
The playbook: pick 3-5 user journeys. Define what success looks like. Build 10-20 test cases. Use cheap graders first. Wire evals into CI. Sample production traces. The tools already exist. The question is whether your team uses them.
The remaining question is simple: does your team treat evaluation as something the system does all the time, or something someone does to the system after the fact?



