Encyclopedia Evalica / Evaluation / AI eval

AI eval
/ay.eye ee.val/A structured test that measures whether an AI system is producing good outputs. Evals pass a dataset of inputs through a task function, score the outputs, and produce a result you can compare against previous runs. (noun)
Why it matters
Early in development, you can check AI output quality by reading a few examples and deciding whether they look right. That approach stops working as soon as you have multiple prompts, models, or use cases. Systematic evals replace gut-feel judgment with repeatable measurement. You define a dataset of inputs and expected behaviors, run your system against it, and score the outputs with deterministic checks or calibrated LLM judges. The result is a number you can compare across runs, so you can answer questions like whether a prompt change actually helped or whether a new model regressed on edge cases. Without evals, every change is a gamble because you have no way to know what you broke until a developer or end user reports it. With evals in CI or as part of your deployment process, you catch regressions before they reach production and build confidence that your system is improving over time.
“Before shipping, we ran an eval to make sure the new tool-calling logic didn't regress.”
Customer example
Dropbox built a multi-tier AI eval pipeline for AI search: 150 smoke tests pre-merge, 10,000+ evals post-merge, and online LLM-as-a-judge scoring in production. They use thumbs up/down feedback and traces to continuously expand their eval datasets and catch regressions in real time. Read more
Related Evaluation terms
- Absolute scoring •
- Agent •
- Alignment •
- Annotation schema •
- Baseline •
- Baseline experiment •
- Benchmark •
- Calibration •
- CI/CD integration •
- Coherence •
- Confidence interval •
- Eval harness •
- Eval leakage •
- Experiment •
- Factuality •
- Failure mode •
- Faithfulness •
- Feedback signal •
- Groundedness •
- Hallucination •
- Inter-annotator agreement (IAA) •
- LLM-as-a-judge •
- Loop •
- Model comparison •
- Multimodal •
- Non-determinism •
- Offline evaluation •
- Pairwise evaluation •
- Pass@k •
- Playground •
- Quality gate •
- RAG (retrieval-augmented generation) •
- RAG evaluation •
- Reference-based scoring •
- Reference-free scoring •
- Regression testing •
- Release criteria •
- Remote evaluation •
- Rubric •
- Safety •
- Score distribution •
- Scorer •
- Semantic failure •
- Signal-to-noise ratio •
- Task (eval task) •
- Toxicity score
From the docs
Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.
Start building