ABCDEFGHIJKLMNOPQRSTUVWXYZ

Encyclopedia Evalica / Evaluation / AI eval

AI eval

/ay.eye ee.val/A structured test that measures whether an AI system is producing good outputs. Evals pass a dataset of inputs through a task function, score the outputs, and produce a result you can compare against previous runs. (noun)

Why it matters

Early in development, you can check AI output quality by reading a few examples and deciding whether they look right. That approach stops working as soon as you have multiple prompts, models, or use cases. Systematic evals replace gut-feel judgment with repeatable measurement. You define a dataset of inputs and expected behaviors, run your system against it, and score the outputs with deterministic checks or calibrated LLM judges. The result is a number you can compare across runs, so you can answer questions like whether a prompt change actually helped or whether a new model regressed on edge cases. Without evals, every change is a gamble because you have no way to know what you broke until a developer or end user reports it. With evals in CI or as part of your deployment process, you catch regressions before they reach production and build confidence that your system is improving over time.

“Before shipping, we ran an eval to make sure the new tool-calling logic didn't regress.”

Customer example

Dropbox built a multi-tier AI eval pipeline for AI search: 150 smoke tests pre-merge, 10,000+ evals post-merge, and online LLM-as-a-judge scoring in production. They use thumbs up/down feedback and traces to continuously expand their eval datasets and catch regressions in real time. Read more

Related Evaluation terms

Absolute scoring

•

Agent

•

Alignment

•

Annotation schema

•

Baseline

•

Baseline experiment

•

Benchmark

•

Calibration

•

CI/CD integration

•

Coherence

•

Confidence interval

•

Eval harness

•

Eval leakage

•

Experiment

•

Factuality

•

Failure mode

•

Faithfulness

•

Feedback signal

•

Groundedness

•

Hallucination

•

Inter-annotator agreement (IAA)

•

LLM-as-a-judge

•

Loop

•

Model comparison

•

Multimodal

•

Non-determinism

•

Offline evaluation

•

Pairwise evaluation

•

Pass@k

•

Playground

•

Quality gate

•

RAG (retrieval-augmented generation)

•

RAG evaluation

•

Reference-based scoring

•

Reference-free scoring

•

Regression testing

•

Release criteria

•

Remote evaluation

•

Rubric

•

Safety

•

Score distribution

•

Scorer

•

Semantic failure

•

Signal-to-noise ratio

•

Task (eval task)

•

Toxicity score

From the docs

Evaluate systematically

•

Evaluation quickstart

•

Create experiments

•

Create scorers

Get started with Evals

Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.