Encyclopedia Evalica / Observability / Online evaluation (production scoring)

Online evaluation (production scoring)
/'aw.nleyen ih.va.lyoo'ay.shuhn pruh'duh.kshuhn 'skaw.rihng/An eval performed on live production traffic, scoring real interactions asynchronously as they happen. Online evals catch regressions that only appear in real usage. (noun)
Why it matters
Offline evals tell you how your system performs on a fixed dataset, but production traffic is never static. Real queries include phrasings, topics, and edge cases that your eval dataset doesn't include. Online evals close this gap by running scorers on live traces as they arrive, giving you a continuous quality signal on actual usage. This catches problems that offline evals miss, like a new class of user queries that your system handles poorly, or a model behavior change from a provider update that only affects certain input patterns. Online scoring also creates a natural pipeline for improving your offline datasets. When a production trace scores low, you can review it, validate the label, and add it to your eval suite so the same failure is caught before deployment next time. Without online evals, you are flying blind between releases, relying on user complaints to surface quality issues.
“The online eval caught a regression within an hour of deployment.”
Related Observability terms
From the docs
Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.
Start building