ABCDEFGHIJKLMNOPQRSTUVWXYZ

Encyclopedia Evalica / Evaluation / LLM-as-a-judge

LLM-as-a-judge

/el.el.em az uh juhj/A scoring approach where a language model evaluates the quality or correctness of another model's output, typically guided by a rubric. It enables scalable evals of open-ended outputs. (noun)

Why it matters

Many quality dimensions in AI, such as tone, helpfulness, and whether an answer actually addresses the question, are hard to capture with deterministic code. Human review is the gold standard, but too slow and expensive to run on every output. LLM-as-a-judge fills the gap by applying a rubric to model outputs at scale, letting you score thousands of examples in minutes rather than days. But the challenge is calibration. An LLM judge is only useful if it agrees with your human reviewers on the cases that matter. That means you need to validate it against a labeled set, iterate on the rubric prompt until agreement is high, and monitor for drift as your data distribution changes. The best approach is often a layered one where code-based checks catch structural issues, LLM judges handle subjective dimensions, and human review covers the highest-stakes or most ambiguous cases. Reusable scorer definitions make it practical to run the same LLM judge in both offline evals and production scoring.

“We use LLM-as-a-judge to score tone and correctness at scale.”

Customer example

Dropbox uses online LLM-as-a-judge in production (alongside offline test suites) to score real queries against the same rubric and detect regressions in real time. Read more

Related Evaluation terms

Absolute scoring

•

Agent

•

AI eval

•

Alignment

•

Annotation schema

•

Baseline

•

Baseline experiment

•

Benchmark

•

Calibration

•

CI/CD integration

•

Coherence

•

Confidence interval

•

Eval harness

•

Eval leakage

•

Experiment

•

Factuality

•

Failure mode

•

Faithfulness

•

Feedback signal

•

Groundedness

•

Hallucination

•

Inter-annotator agreement (IAA)

•

Loop

•

Model comparison

•

Multimodal

•

Non-determinism

•

Offline evaluation

•

Pairwise evaluation

•

Pass@k

•

Playground

•

Quality gate

•

RAG (retrieval-augmented generation)

•

RAG evaluation

•

Reference-based scoring

•

Reference-free scoring

•

Regression testing

•

Release criteria

•

Remote evaluation

•

Rubric

•

Safety

•

Score distribution

•

Scorer

•

Semantic failure

•

Signal-to-noise ratio

•

Task (eval task)

•

Toxicity score

From the docs

Evaluate systematically

•

Create experiments

•

Create scorers

•

Trace LLM calls

•

RubyLLM

Get started with Evals

Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.