Encyclopedia Evalica / Evaluation / Experiment

Experiment illustration

Experiment

/ih'kspeh.ruh.muhnt/An immutable snapshot of a single eval run, including the dataset, task configuration, scores, and outputs. Experiments make results reproducible and shareable. (noun)

Why it matters

When you change a prompt, swap a model, or refactor a pipeline step, you need to know whether the change helped or hurt. Experiments give you that answer by creating a snapshot of every output, score, and configuration detail from a single eval run. Because experiment results are immutable, you can always go back and compare exactly what changed between two runs side by side. This reproducibility is what separates disciplined iteration from trial and error. If you ran an experiment last week with one prompt and another this week with a revised version, you can see precisely which test cases improved, which regressed, and by how much. Without experiment records, teams lose track of what they already tried, re-run the same variations, or ship changes without realizing they regressed on a subset of cases. Experiments also serve as documentation of your decision-making process, making it easy for teammates to understand why the current configuration was chosen.

This experiment compares GPT-4.1 vs. Claude on the same dataset.

Related Evaluation terms

From the docs

Get started with Evals

Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.

Start building