Online evaluation (production scoring)

/'aw.nleyen ih.va.lyoo'ay.shuhn pruh'duh.kshuhn 'skaw.rihng/An eval performed on live production traffic, scoring real interactions asynchronously as they happen. Online evals catch regressions that only appear in real usage. (noun)

Why it matters

Offline evals tell you how your system performs on a fixed dataset, but production traffic is never static. Real queries include phrasings, topics, and edge cases that your eval dataset doesn't include. Online evals close this gap by running scorers on live traces as they arrive, giving you a continuous quality signal on actual usage. This catches problems that offline evals miss, like a new class of user queries that your system handles poorly, or a model behavior change from a provider update that only affects certain input patterns. Online scoring also creates a natural pipeline for improving your offline datasets. When a production trace scores low, you can review it, validate the label, and add it to your eval suite so the same failure is caught before deployment next time. Without online evals, you are flying blind between releases, relying on user complaints to surface quality issues.

“The online eval caught a regression within an hour of deployment.”

Related Observability terms

AI observability

•

Alert / threshold

•

Dashboard

•

Data flywheel

•

Deep search

•

Drift

•

Error rate

•

Feedback loop

•

Logs

•

Model drift

•

P50 / P95 / P99 (Percentiles)

•

Sampling rate

•

Service Level Indicator (SLI)

•

Service Level Objective (SLO)

•

Time-to-first-token (TTFT)

•

Token usage / cost tracking

•

Topics

From the docs

Observe your application

•

View your logs

•

Monitor with dashboards

•

Score production traces

•

Evaluation quickstart

Get started with Evals

Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.

Start building

← Offline evaluation

Manifesto

P50 / P95 / P99 (Percentiles) →