Pillar 9: Evaluation and Measurement
Prompts that ship in your application are production code. Test them like production code.
When prompts become part of a product (system prompts, prompt templates, LLM-powered features, agent instructions), they carry the same weight as any other code in your codebase. They have inputs, outputs, and expected behaviors. They can regress. They can break when models update. They can behave differently across edge cases. And unlike a conversation with a coding assistant where you can course-correct in real time, production prompts run unsupervised at scale.
The discipline of evaluation applies the same rigor to these prompts that you already apply to your code: define expected behavior, write test cases, run them systematically, and gate deployments on the results. Without evals, you cannot answer basic questions: did that prompt change make things better or just different? Are we getting consistent results across user inputs, or is output quality a lottery?
Evals are also the quality gate for everything else in this repository. Context engineering (Pillar 1) is only as good as your ability to measure whether a rules file change actually improved output. Prompt engineering (Pillar 3) without measurement is just vibes. Guardrails (Pillar 5) catch syntax and style issues, but evals catch semantic issues: did the AI understand the requirement? Did it implement the right business logic?
What We Expect
Section titled “What We Expect”You treat prompts in your application as testable software. Any prompt that ships as part of a product, pipeline, or automated workflow gets a test suite. Define input scenarios (including edge cases and adversarial inputs), define expected output characteristics, and run evaluations before deploying changes. Tools like promptfoo let you define test cases, run them against multiple prompt versions or models, and compare results systematically.
This is not about testing the prompts you use in conversation with your coding assistant. Those are iterative and disposable. This is about prompts that run in production, at scale, without a human in the loop to catch mistakes.
You pick the evaluation technique that matches the question. “Evaluation” is not one thing, and mature eval suites combine several methods:
- Deterministic assertions: exact match, regex, schema validation, contains / not-contains. Use for structured outputs and tool calls with known shape.
- LLM-as-judge: a second model grades output against a rubric. Useful for subjective qualities. Judges carry known biases (position, verbosity, self-preference) and degrade when grading needs external context; treat them as a filter, not ground truth. See Pillar 6.
- Pairwise / preference comparison: A vs. B, which is better. Best when “correct” is hard to define in absolute terms.
- Rubric scoring: numeric or categorical grades on defined dimensions, run by an LLM or human judge.
- Behavioral / trajectory evaluation: for agents, the path matters as much as the answer. Final-answer-only evaluation misses agents that got lucky or burned tokens on detours.
- Human evaluation on a golden dataset: the reference standard. Reserved for core cases where stakes justify the cost; use it to calibrate cheaper automated judges.
You integrate prompt evaluation into your CI/CD pipeline. Just like code changes trigger test suites, prompt changes should trigger evaluation runs. Set quality thresholds that gate deployment. Track regression across model updates. When a provider ships a new model version, your eval suite tells you whether your prompts still perform before you roll it out to users.
You have criteria for “good output” before you evaluate. Define what success looks like before comparing results. Common dimensions:
- Accuracy / correctness: does it give the right answer?
- Groundedness / faithfulness: for RAG systems, does the answer derive from retrieved context rather than parametric memory? (See Pillar 0 for RAG background.)
- Relevance: does it address the actual question asked?
- Format compliance: does it obey the schema or output constraints?
- Safety: no harmful, PII-leaking, or policy-violating output.
- Latency: track tail latency, not just the average. Tails dominate user experience.
- Cost: tokens per call. Quality gains that come with outsized cost increases are regressions, not wins. Given the spread across provider pricing, cost must be a tracked metric.
You build your eval dataset from real signals, not fabricated ones. Draw test cases from production traces (with PII scrubbed per Pillar 10), supplement with synthetic inputs for edge cases when production data is scarce, and add every production bug as a permanent regression case. Small and high-signal beats large and noisy; curate, do not dump.
You evaluate rules files and project context systematically. Your agent configuration file (AGENTS.md, CLAUDE.md, .cursorrules, or equivalent) and project documentation directly shape every interaction. When AI output consistently misses the mark (wrong patterns, ignoring conventions, missing requirements), the first place to look is your context configuration. Test changes to your rules files the same way you would test code: make a change, run representative tasks, compare output quality.
You version and track your production prompts. Prompts drift. Models update. Requirements shift. If your prompts are not versioned alongside your code, you have no way to correlate a change in output quality with a change in your prompt, your model, or your data. Treat your prompt library as a first-class artifact in source control.
You evaluate in production, not just pre-deploy. Pre-deploy evals catch the regressions you can predict; online evaluation catches distribution shift, new failure modes, provider model updates, and user behavior evolution. Instrument for trace sampling against your eval suite, user feedback signals (explicit and implicit), aggregate drift detection, and A/B rollouts of prompt changes.
You close the feedback loop. When a prompt or workflow produces poor results, diagnose why. Was it poor scoping? Bad context? A model update that changed behavior? Insufficient test coverage on edge cases? Each failure mode has a different fix.
Document what you learn and update your prompts, eval suite, or workflow accordingly. Production prompt failures are bugs. Treat them with the same urgency and rigor as any other production incident.
Anti-patterns
Section titled “Anti-patterns”- Deploying prompt changes to production without running evaluations
- Evaluating output by “feel” without defined criteria for what good looks like
- Applying a single evaluation technique (usually LLM-as-judge) to every problem regardless of fit
- Treating LLM-as-judge output as ground truth without sampling and validating judge agreement against human labels
- Scoring only quality while ignoring cost and latency regressions in the same change
- Building an eval dataset from synthetic inputs alone, never from production traces
- Letting the eval dataset go stale by never adding regression cases from real production bugs
- No production observability: discovering a prompt is broken only when users complain
- No regression testing when switching models or model versions
- Treating prompts as static strings rather than versioned, testable code
- Ignoring patterns in output failures instead of diagnosing root causes in the prompt
- Not gating deployments on eval results the same way you gate on test results
- Conflating ad-hoc development prompting (conversations with your coding assistant) with production prompt engineering (prompts that ship in your application)
Resources
Section titled “Resources”Eval Tooling
Section titled “Eval Tooling”- promptfoo - Open-source prompt evaluation framework with CI/CD integration, test cases, and multi-model comparison
- promptfoo CI/CD Integration - Guide for adding prompt evaluation to GitHub Actions, GitLab CI, Jenkins, and Azure Pipelines
- Arize AI: LLM Evaluations in CI/CD - Practical guide on dataset curation, LLM-as-Judge, and automation
Observability and Online Evaluation
Section titled “Observability and Online Evaluation”- Langfuse - Open-source LLM observability and evaluation; traces, user feedback, prompt management
- Arize Phoenix - Open-source tracing and evaluation framework for LLM applications
- LangSmith - Observability, dataset curation, and eval platform from the LangChain team
- Helicone - Proxy-based observability with request caching and cost tracking
- Traceloop - OpenTelemetry-native tracing for LLM apps; pairs with any metrics backend
Judges, Rubrics, and RAG Evaluation
Section titled “Judges, Rubrics, and RAG Evaluation”- LLM-as-Judge Robustness (2025) - The limits and biases of model-graded evaluation. Required reading before building a judge-based pipeline.
- RAGAS - Framework for RAG-specific dimensions: faithfulness, answer relevance, context precision
- Evaluating RAG Applications with RAGAS - Hands-on walkthrough applying the framework to a real RAG pipeline
- Chatbot Arena - Canonical pairwise preference benchmark; useful as a methodology reference even when not running it
Practices and Methodology
Section titled “Practices and Methodology”- Traceloop: Automated Prompt Regression Testing - Four-component framework for prompt versioning, test datasets, scoring, and deployment gates
- Anthropic Prompt Engineering Documentation - Covers evaluation techniques alongside prompting strategies
- Verbalized Sampling - Technique for generating diverse output variations for comparison
Related Pillars
Section titled “Related Pillars”- Pillar 3: Prompt Engineering - The prompting techniques that evaluation measures
- Pillar 5: Guardrails and Quality - Automated quality enforcement that complements eval
- Pillar 8: Continuous Evolution - The experimentation mindset that evaluation enables
- See Learning Paths for deeper dives