ABCDEFGHIJKLMNOPQRSTUVWXYZ

Encyclopedia Evalica / Evaluation / Regression testing

Regression testing

/ruh'greh.shuhn 'teh.stihng/Re-running a fixed eval suite after every change to detect score decreases before they reach production. Regression testing is the eval equivalent of unit tests. (noun)

Why it matters

In traditional software, regressions show up as failing tests or crashes. In AI systems, regressions are subtler. A prompt change might improve average quality while degrading performance on a specific category of inputs. A model swap might change tone in ways that do not trigger any errors but frustrate your developers and end users. These regressions are hard to catch without a fixed eval suite that you run after every change. Regression testing for AI means maintaining a stable dataset and set of scorers, running them automatically in CI or before deployment, and comparing results against a known baseline. The key difference from traditional regression testing is that you often need to look at score distributions rather than binary pass/fail results, because small shifts in average quality can mask larger regressions on specific subsets. Setting explicit thresholds and tracking per-category performance helps you catch these subtle degradations before they compound into a visible quality problem.

“Regression testing blocked the release when groundedness dipped below the threshold.”

Customer example

Navan runs regression testing on production voice calls continuously, evaluating changes against established baselines to prevent quality drops before they reach travelers. Read more

Related Evaluation terms

Absolute scoring

•

Agent

•

AI eval

•

Alignment

•

Annotation schema

•

Baseline

•

Baseline experiment

•

Benchmark

•

Calibration

•

CI/CD integration

•

Coherence

•

Confidence interval

•

Eval harness

•

Eval leakage

•

Experiment

•

Factuality

•

Failure mode

•

Faithfulness

•

Feedback signal

•

Groundedness

•

Hallucination

•

Inter-annotator agreement (IAA)

•

LLM-as-a-judge

•

Loop

•

Model comparison

•

Multimodal

•

Non-determinism

•

Offline evaluation

•

Pairwise evaluation

•

Pass@k

•

Playground

•

Quality gate

•

RAG (retrieval-augmented generation)

•

RAG evaluation

•

Reference-based scoring

•

Reference-free scoring

•

Release criteria

•

Remote evaluation

•

Rubric

•

Safety

•

Score distribution

•

Scorer

•

Semantic failure

•

Signal-to-noise ratio

•

Task (eval task)

•

Toxicity score

From the docs

Evaluate systematically

•

Create experiments

•

Create scorers

•

Glossary

Get started with Evals

Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.

Start building

← Reference-free scoring

Manifesto

Release criteria →