Encyclopedia Evalica / Evaluation / Scorer
Scorer
/'skaw.rer/A function that measures the quality of a task's output. Scorers can be deterministic code checks, statistical measures, or LLM-based judges. (noun)
Why it matters
Scorers are how you turn subjective quality judgments into repeatable measurements. They range from simple code checks (did the output contain valid JSON?) to calibrated LLM judges (is this answer helpful and grounded?) to human review queues for the highest-stakes cases. The real value in a scorer is its reusability. When you define a scorer once and run it in both offline experiments and production scoring, you get a consistent quality signal across your entire development lifecycle. A groundedness scorer that catches hallucinations in your eval suite can also flag them in live traffic without any additional work. Teams that invest in a shared library of well-calibrated scorers iterate faster because every prompt change, model swap, or pipeline refactor gets evaluated against the same bar automatically.
“We wrote a scorer that checks whether the answer includes at least one valid citation.”
Customer example
Dropbox evolved from generic similarity metrics to LLM-as-a-judge scorers with versioned rubrics and runs them across offline suites and sampled production traffic. Read more
Related Evaluation terms
- Absolute scoring •
- Agent •
- AI eval •
- Alignment •
- Annotation schema •
- Baseline •
- Baseline experiment •
- Benchmark •
- Calibration •
- CI/CD integration •
- Coherence •
- Confidence interval •
- Eval harness •
- Eval leakage •
- Experiment •
- Factuality •
- Failure mode •
- Faithfulness •
- Feedback signal •
- Groundedness •
- Hallucination •
- Inter-annotator agreement (IAA) •
- LLM-as-a-judge •
- Loop •
- Model comparison •
- Multimodal •
- Non-determinism •
- Offline evaluation •
- Pairwise evaluation •
- Pass@k •
- Playground •
- Quality gate •
- RAG (retrieval-augmented generation) •
- RAG evaluation •
- Reference-based scoring •
- Reference-free scoring •
- Regression testing •
- Release criteria •
- Remote evaluation •
- Rubric •
- Safety •
- Score distribution •
- Semantic failure •
- Signal-to-noise ratio •
- Task (eval task) •
- Toxicity score
From the docs
Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.
Start building