Encyclopedia Evalica / Evaluation / Experiment

Experiment
/ih'kspeh.ruh.muhnt/An immutable snapshot of a single eval run, including the dataset, task configuration, scores, and outputs. Experiments make results reproducible and shareable. (noun)
Why it matters
When you change a prompt, swap a model, or refactor a pipeline step, you need to know whether the change helped or hurt. Experiments give you that answer by creating a snapshot of every output, score, and configuration detail from a single eval run. Because experiment results are immutable, you can always go back and compare exactly what changed between two runs side by side. This reproducibility is what separates disciplined iteration from trial and error. If you ran an experiment last week with one prompt and another this week with a revised version, you can see precisely which test cases improved, which regressed, and by how much. Without experiment records, teams lose track of what they already tried, re-run the same variations, or ship changes without realizing they regressed on a subset of cases. Experiments also serve as documentation of your decision-making process, making it easy for teammates to understand why the current configuration was chosen.
“This experiment compares GPT-4.1 vs. Claude on the same dataset.”
Related Evaluation terms
- Absolute scoring •
- Agent •
- AI eval •
- Alignment •
- Annotation schema •
- Baseline •
- Baseline experiment •
- Benchmark •
- Calibration •
- CI/CD integration •
- Coherence •
- Confidence interval •
- Eval harness •
- Eval leakage •
- Factuality •
- Failure mode •
- Faithfulness •
- Feedback signal •
- Groundedness •
- Hallucination •
- Inter-annotator agreement (IAA) •
- LLM-as-a-judge •
- Loop •
- Model comparison •
- Multimodal •
- Non-determinism •
- Offline evaluation •
- Pairwise evaluation •
- Pass@k •
- Playground •
- Quality gate •
- RAG (retrieval-augmented generation) •
- RAG evaluation •
- Reference-based scoring •
- Reference-free scoring •
- Regression testing •
- Release criteria •
- Remote evaluation •
- Rubric •
- Safety •
- Score distribution •
- Scorer •
- Semantic failure •
- Signal-to-noise ratio •
- Task (eval task) •
- Toxicity score
From the docs
Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.
Start building