Encyclopedia Evalica / Evaluation / Scorer

Scorer

/'skaw.rer/A function that measures the quality of a task's output. Scorers can be deterministic code checks, statistical measures, or LLM-based judges. (noun)

Why it matters

Scorers are how you turn subjective quality judgments into repeatable measurements. They range from simple code checks (did the output contain valid JSON?) to calibrated LLM judges (is this answer helpful and grounded?) to human review queues for the highest-stakes cases. The real value in a scorer is its reusability. When you define a scorer once and run it in both offline experiments and production scoring, you get a consistent quality signal across your entire development lifecycle. A groundedness scorer that catches hallucinations in your eval suite can also flag them in live traffic without any additional work. Teams that invest in a shared library of well-calibrated scorers iterate faster because every prompt change, model swap, or pipeline refactor gets evaluated against the same bar automatically.

We wrote a scorer that checks whether the answer includes at least one valid citation.

Customer example

Dropbox evolved from generic similarity metrics to LLM-as-a-judge scorers with versioned rubrics and runs them across offline suites and sampled production traffic. Read more

Related Evaluation terms

From the docs

Get started with Evals

Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.

Start building