Encyclopedia Evalica / Evaluation / Non-determinism

Non-determinism

/nahn.dih'ter.muh.nih.zuhm/The property of AI systems where the same input can produce different outputs across identical requests. (noun)

Why it matters

Non-determinism is the fundamental reason AI systems need evals rather than just testing. When the same input can produce different outputs on every run, a single passing test tells you almost nothing. You need to measure quality across distributions of outputs, track score trends over time, and build datasets large enough to capture the variance. This also changes how you think about regressions. A prompt change might improve average quality while making edge cases worse, but this won't be identified without running your eval suite across enough examples to see the distribution shift.

Because of non-determinism, we evaluate changes on distributions of scores, not one-off outputs.

Customer example

Notion embraced non-determinism by moving ~70 engineers beyond "vibe checks" to systematic evals from feedback and traces, so the team can ship quickly even as agent behaviors and outcomes vary across runs. Read more

Related Evaluation terms

From the docs

Get started with Evals

Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.

Start building