Skip to content

Pillar 9: Evaluation and Measurement

Prompts that ship in your application are production code. They drift, regress, and break when models update; test, version, and gate deployments on them like any other production system.

Prompts that ship in your application are production code. They drift, regress, and break when models update; test, version, and gate deployments on them like any other production system.

When prompts become part of a product (system prompts, prompt templates, LLM-powered features, agent instructions), they carry the same weight as any other code in your codebase. They have inputs, outputs, and expected behaviors. They can regress. They can break when models update. They can behave differently across edge cases. And unlike a conversation with a coding assistant where you can course-correct in real time, production prompts run unsupervised at scale.

The discipline of evaluation applies the same rigor to these prompts that you already apply to your code: define expected behavior, write test cases, run them systematically, and gate deployments on the results. Without evals, you cannot answer basic questions: did that prompt change make things better or just different? Are we getting consistent results across user inputs, or is output quality a lottery?

Evals are also the quality gate for everything else in this repository. Context engineering (Pillar 1) is only as good as your ability to measure whether a rules file change actually improved output. Prompt engineering (Pillar 3) without measurement is just vibes. Guardrails (Pillar 5) catch syntax and style issues, but evals catch semantic issues: did the AI understand the requirement? Did it implement the right business logic?

You treat prompts in your application as testable software

Section titled “You treat prompts in your application as testable software”

Any prompt that ships as part of a product, pipeline, or automated workflow gets a test suite. Define input scenarios (including edge cases and adversarial inputs), define expected output characteristics, and run evaluations before deploying changes. Tools like promptfoo let you define test cases, run them against multiple prompt versions or models, and compare results systematically.

This is not about testing the prompts you use in conversation with your coding assistant. Those are iterative and disposable. This is about prompts that run in production, at scale, without a human in the loop to catch mistakes.

You pick the evaluation technique that matches the question

Section titled “You pick the evaluation technique that matches the question”

"Evaluation" is not one thing, and mature eval suites combine several methods: deterministic assertions for structured output, LLM-as-judge and rubric scoring for subjective quality, pairwise comparison when "correct" is hard to define in absolute terms, behavioral or trajectory evaluation for agents (where the path matters as much as the answer), and human review on a golden dataset as the reference standard.

Each technique has different strengths and known failure modes. LLM-as-judge in particular carries position, verbosity, and self-preference bias; treat it as a filter, not ground truth, and validate judge-human agreement on a held-out set (see Pillar 6). Hamel Husain's "LLM Evals: Everything You Need to Know" is the most comprehensive practitioner walkthrough of when each method earns its place.

You integrate prompt evaluation into your CI/CD pipeline

Section titled “You integrate prompt evaluation into your CI/CD pipeline”

Just like code changes trigger test suites, prompt changes should trigger evaluation runs. Set quality thresholds that gate deployment. Track regression across model updates. When a provider ships a new model version, your eval suite tells you whether your prompts still perform before you roll it out to users.

You have criteria for "good output" before you evaluate

Section titled “You have criteria for "good output" before you evaluate”

Define what success looks like before comparing results. Common dimensions:

  • Accuracy / correctness: does it give the right answer?
  • Groundedness / faithfulness: for RAG systems, does the answer derive from retrieved context rather than parametric memory? (See Pillar 0 for RAG background.)
  • Relevance: does it address the actual question asked?
  • Format compliance: does it obey the schema or output constraints?
  • Safety: no harmful, PII-leaking, or policy-violating output.
  • Latency: track tail latency, not just the average. Tails dominate user experience.
  • Cost: tokens per call. Quality gains that come with outsized cost increases are regressions, not wins. Given the spread across provider pricing, cost must be a tracked metric.

You build your eval dataset from real signals, not fabricated ones

Section titled “You build your eval dataset from real signals, not fabricated ones”

Draw test cases from production traces (with PII scrubbed per Pillar 10), supplement with synthetic inputs for edge cases when production data is scarce, and add every production bug as a permanent regression case. Small and high-signal beats large and noisy; curate, do not dump.

You apply privacy and access controls to production eval data

Section titled “You apply privacy and access controls to production eval data”

Production traces, user-feedback logs, and online-eval samples are a regulated dataset, not a free quality-improvement input. Scope them with: explicit retention windows (deletion after N days unless promoted to a curated regression case); redaction that covers secrets, customer identifiers, internal hostnames and IPs, and personal data (not just textbook PII categories); role-based access so engineers see only what their work requires; tenant isolation if you serve enterprise customers under contractual data-isolation requirements; a documented basis for collection (consent, contract, legitimate interest depending on jurisdiction); and a deletion path when data subjects exercise their rights or contracts terminate. Online evaluation requires the same privacy controls as any production data pipeline; see Pillar 10 for the broader data-hygiene framing.

Your agent configuration file (AGENTS.md, CLAUDE.md, .cursorrules, or equivalent) and project documentation directly shape every interaction. When AI output consistently misses the mark (wrong patterns, ignoring conventions, missing requirements), the first place to look is your context configuration. Test changes to your rules files the same way you would test code: make a change, run representative tasks, compare output quality.

You version and track your production prompts

Section titled “You version and track your production prompts”

Prompts drift. Models update. Requirements shift. If your prompts are not versioned alongside your code, you have no way to correlate a change in output quality with a change in your prompt, your model, or your data. Treat your prompt library as a first-class artifact in source control.

You evaluate in production, not just pre-deploy

Section titled “You evaluate in production, not just pre-deploy”

Pre-deploy evals catch the regressions you can predict; online evaluation catches distribution shift, new failure modes, provider model updates, and user behavior evolution. Instrument for trace sampling against your eval suite, user feedback signals (explicit and implicit), aggregate drift detection, and A/B rollouts of prompt changes.

You evaluate whether automated prompt optimization fits your eval maturity

Section titled “You evaluate whether automated prompt optimization fits your eval maturity”

Automated prompt and program optimization is a frontier practice for teams that already have a mature eval suite, a documented failure taxonomy from human-led error analysis, and the infrastructure to run optimization runs reproducibly. If those prerequisites are not in place, focus on Pillar 3 and the earlier expectations in this pillar first.

When the prerequisites are met, the eval-driven loop has four parts: generate output, measure it, diagnose what went wrong, modify the system. When the modification target is the prompt or compound program (not the model weights), automated optimization is production-viable. DSPy treats prompts as learnable parameters and uses Bayesian optimization (MIPROv2) to tune instructions and few-shot examples for multi-stage pipelines. GEPA is the current state of the art for compound systems: it samples system-level trajectories, reflects on them in natural language to diagnose problems, and combines complementary lessons from a Pareto frontier of attempts, reportedly outperforming reinforcement-learning baselines by 10-20% with up to 35x fewer rollouts.

A critical caveat from practice: automated optimization hill-climbs a predefined evaluation metric. It can refine a prompt to perform better on known failures but cannot discover new ones. Run human-led error analysis first to build a failure taxonomy, then automate the last mile. The Amazon Science survey maps the broader optimization landscape (APE, OPRO, ProTeGi, TextGrad, MIPROv2, GEPA) along consistent axes if you need to choose a method.

When a prompt or workflow produces poor results, diagnose why. Was it poor scoping? Bad context? A model update that changed behavior? Insufficient test coverage on edge cases? Each failure mode has a different fix.

Document what you learn and update your prompts, eval suite, or workflow accordingly. Production prompt failures are bugs. Treat them with the same urgency and rigor as any other production incident.

  • Deploying prompt changes to production without running evaluations
  • Evaluating output by "feel" without defined criteria for what good looks like
  • Applying a single evaluation technique (usually LLM-as-judge) to every problem regardless of fit
  • Treating LLM-as-judge output as ground truth without sampling and validating judge agreement against human labels
  • Scoring only quality while ignoring cost and latency regressions in the same change
  • Building an eval dataset from synthetic inputs alone, never from production traces
  • Letting the eval dataset go stale by never adding regression cases from real production bugs
  • No production observability: discovering a prompt is broken only when users complain
  • No regression testing when switching models or model versions
  • Treating prompts as static strings rather than versioned, testable code
  • Ignoring patterns in output failures instead of diagnosing root causes in the prompt
  • Not gating deployments on eval results the same way you gate on test results
  • Conflating ad-hoc development prompting (conversations with your coding assistant) with production prompt engineering (prompts that ship in your application)
  • Reaching for automated prompt optimization (DSPy, GEPA, etc.) before you've done error analysis; optimization can hill-climb a known metric but cannot discover new failure modes
  • Shankar et al., "Who Validates the Validators?" / EvalGen - Peer-reviewed academic study of eval criteria drift. Identifies the criteria-from-grading-not-a-priori phenomenon: users need criteria to grade outputs, but grading outputs is what helps users define criteria. Required reading for anyone building a static eval suite.
  • Eugene Yan: LLM-as-Judge Patterns and Best Practices - The canonical practitioner reference on LLM-as-judge. Covers position, length, and self-preference biases; rubric design; and judge-human agreement validation.
  • LLM-as-Judge Robustness - The limits and biases of model-graded evaluation. Required reading before building a judge-based pipeline.
  • Hamel Husain: LLM Evals - Everything You Need to Know - The 2026 comprehensive practitioner reference, consolidating the methodology Hamel teaches with Shreya Shankar in the AI Evals for Engineers and PMs course. Error-analysis-first, criteria-from-grading-not-a-priori, and the practical reasons automated optimization can only refine known failures.
  • Verbalized Sampling - Technique for generating diverse output variations for comparison.
  • Chatbot Arena - Canonical pairwise preference benchmark; useful as a methodology reference even when not running it.