Scorers evaluate AI output quality by assigning scores between 0 and 1 based on criteria you define, like factual accuracy, helpfulness, or correct formatting.Documentation Index
Fetch the complete documentation index at: https://braintrust.dev/docs/llms.txt
Use this file to discover all available pages before exploring further.
Scorer types
Braintrust offers three types of scorers:- Autoevals: Pre-built, battle-tested scorers for common evaluation tasks like factuality checking, semantic similarity, and format validation. Best for standard evaluation needs where reliable scorers already exist.
- LLM-as-a-judge: Use language models to evaluate outputs based on natural language criteria and instructions. Best for subjective judgments like tone, helpfulness, or creativity that are difficult to encode in deterministic code.
- Custom code: Write custom evaluation logic with full control over the scoring algorithm. Best for specific business rules, pattern matching, or calculations unique to your use case.
Where to define scorers
You can define scorers in three places:- Inline in SDK code: Define scorers directly in your evaluation scripts. Best for local development, access to complex dependencies, or application-specific logic tightly coupled to your codebase.
- Pushed via CLI: Define TypeScript or Python scorers in code files and push them to Braintrust. Best for version control in Git, team-wide sharing across projects, and automatic evaluation of production logs.
- Created in UI: Build TypeScript or Python scorers in the Braintrust web interface. Best for rapid prototyping and simple LLM-as-a-judge scorers.
Test scorers
Scorers need to be developed iteratively against real data. When creating or editing a scorer in the UI, use the Run section to test your scorer with data from different sources. Each variable source populates the scorer’s input parameters (likeinput, output, expected, metadata) from a different location.
Test with manual input
Best for initial development when you have a specific example in mind. Use this to quickly prototype and verify basic scorer logic before testing on larger datasets.- Select Editor in the Run section.
- Enter values for
input,output,expected, andmetadatafields. - Click Test to see how your scorer evaluates the example.
- Iterate on your scorer logic based on the results.
Test with a dataset
Best for testing specific scenarios, edge cases, or regression testing. Use this when you want controlled, repeatable test cases or need to ensure your scorer handles specific situations correctly.- Select Dataset in the Run section.
- Choose a dataset from your project.
- Select a record to test with.
- Click Test to see how your scorer evaluates the example.
- Review results to identify patterns and edge cases.
Test with logs
Best for testing against actual usage patterns and debugging real-world edge cases. Use this when you want to see how your scorer performs on data your system is actually generating.- Select Logs in the Run section.
- Select the project containing the logs you want to test against.
- Filter logs to find relevant examples:
- Click Add filter and choose just root spans, specific span names, or a more advanced filter based on specific input, output, metadata, or other values.
- Select a timeframe.
- Click Test to see how your scorer evaluates real production data.
- Identify cases where the scorer needs adjustment for real-world scenarios.
Scorer permissions
Both LLM-as-a-judge and custom code scorers automatically receive aBRAINTRUST_API_KEY environment variable that allows them to:
- Make LLM calls using organization and project AI secrets
- Access attachments from the current project
- Read and write logs to the current project
- Read prompts from the organization
PUT /v1/env_var endpoint.
Optimize with Loop
Generate and improve scorers using Loop: Example queries:- “Write an LLM-as-a-judge scorer for a chatbot that answers product questions”
- “Generate a code-based scorer based on project logs”
- “Optimize the Helpfulness scorer”
- “Adjust the scorer to be more lenient”
Best practices
Start with autoevals: Use pre-built scorers when they fit your needs. They’re well-tested and reliable. Be specific: Define clear evaluation criteria in your scorer prompts or code. Use multiple scorers: Measure different aspects (factuality, helpfulness, tone) with separate scorers. Choose the right scope: Use trace scorers for multi-step workflows and agents. Use span scorers for simple quality checks. Test scorers: Run scorers on known examples to verify they behave as expected. Version scorers: Like prompts, scorers are versioned automatically. Track what works. Balance cost and quality: LLM-as-a-judge scorers are more flexible but cost more and take longer than custom code scorers.Create custom table views
The Scorers page supports custom table views to save your preferred filters, column order, and display settings. To create or update a custom table view:- Apply the filters and display settings you want.
- Open the menu and select Save view… or Save view as….
Custom table views are visible to all project members. Creating or editing a table view requires the Update project permission.
Set default table views
You can set default views at two levels:- Organization default: Visible to all members when they open the page. This applies per page — for example, you can set separate organization defaults for Logs, Experiments, and Review. To set an organization default, you need the Manage settings organization permission (included by default in the Owner role). See Access control for details.
- Personal default: Overrides the organization default for you only. Personal defaults are stored in your browser, so they do not carry over across devices or browsers.
- Switch to the view you want by selecting it from the menu.
- Open the menu again and hover over the currently selected view to reveal its submenu.
- Choose Set as personal default view or Set as organization default view.
- Open the menu and hover over the currently selected view to reveal its submenu.
- Choose Clear personal default view or Clear organization default view.
Next steps
- Autoevals: Drop-in pre-built scorers
- LLM-as-a-judge: Natural language evaluation criteria
- Custom code: Full control over scoring logic
- Run evaluations using your scorers
- Score production logs with online scoring rules