Encyclopedia Evalica / Evaluation / Semantic failure
Semantic failure
/suh'man.tihk 'fay.lyer/When an AI system is operationally healthy (low latency, no errors) but producing outputs that are factually wrong, off-brand, or harmful. It's invisible to infrastructure monitoring and only detectable through evals. (noun)
Why it matters
Semantic failures are the defining challenge of AI quality. Your infrastructure monitoring will show green dashboards while your AI gives wrong answers, because a 200 OK response with a hallucinated answer looks identical to a correct one at the HTTP level. The only way to catch it is with evals, whether that means scoring production traces, running regression suites, or having human reviewers flag bad outputs. Teams that rely solely on traditional monitoring will miss the majority of their AI quality issues.
“Latency was great, but the assistant kept giving wrong refund rules, so it was actually a semantic failure.”
Customer example
Retool discovered a semantic failure where an agent confidently claimed it completed a task when it hadn't. By analyzing traces with Loop, they found the root cause (missing tool definitions) and fixed the underlying system issue. Read more
Related Evaluation terms
- Absolute scoring •
- Agent •
- AI eval •
- Alignment •
- Annotation schema •
- Baseline •
- Baseline experiment •
- Benchmark •
- Calibration •
- CI/CD integration •
- Coherence •
- Confidence interval •
- Eval harness •
- Eval leakage •
- Experiment •
- Factuality •
- Failure mode •
- Faithfulness •
- Feedback signal •
- Groundedness •
- Hallucination •
- Inter-annotator agreement (IAA) •
- LLM-as-a-judge •
- Loop •
- Model comparison •
- Multimodal •
- Non-determinism •
- Offline evaluation •
- Pairwise evaluation •
- Pass@k •
- Playground •
- Quality gate •
- RAG (retrieval-augmented generation) •
- RAG evaluation •
- Reference-based scoring •
- Reference-free scoring •
- Regression testing •
- Release criteria •
- Remote evaluation •
- Rubric •
- Safety •
- Score distribution •
- Scorer •
- Signal-to-noise ratio •
- Task (eval task) •
- Toxicity score
From the docs
Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.
Start building