Measure output quality with scorers

Scorers evaluate AI output quality by assigning scores between 0 and 1 based on criteria you define, like factual accuracy, helpfulness, or correct formatting.

Scorer types

Braintrust offers three types of scorers:

Autoevals: Pre-built, battle-tested scorers for common evaluation tasks like factuality checking, semantic similarity, and format validation. Best for standard evaluation needs where reliable scorers already exist.
LLM-as-a-judge: Use language models to evaluate outputs based on natural language criteria and instructions. Best for subjective judgments like tone, helpfulness, or creativity that are difficult to encode in deterministic code.
Custom code: Write custom evaluation logic with full control over the scoring algorithm. Best for specific business rules, pattern matching, or calculations unique to your use case.

Where to define scorers

You can define scorers in three places:

Inline in SDK code: Define scorers directly in your evaluation scripts. Best for local development, access to complex dependencies, or application-specific logic tightly coupled to your codebase.
Pushed via CLI: Define TypeScript or Python scorers in code files and push them to Braintrust. Best for version control in Git, team-wide sharing across projects, and automatic evaluation of production logs.
Created in UI: Build TypeScript or Python scorers in the Braintrust web interface. Best for rapid prototyping and simple LLM-as-a-judge scorers.

Most teams prototype in the UI, develop complex scorers inline, then push production-ready scorers to Braintrust for team-wide use.

Test scorers

Scorers need to be developed iteratively against real data. When creating or editing a scorer in the UI, use the Run section to test your scorer with data from different sources. Each variable source populates the scorer’s input parameters (like input, output, expected, metadata) from a different location.

Test with manual input

Best for initial development when you have a specific example in mind. Use this to quickly prototype and verify basic scorer logic before testing on larger datasets.

Select Editor in the Run section.
Enter values for input, output, expected, and metadata fields.
Click Test to see how your scorer evaluates the example.
Iterate on your scorer logic based on the results.

Test with a dataset

Best for testing specific scenarios, edge cases, or regression testing. Use this when you want controlled, repeatable test cases or need to ensure your scorer handles specific situations correctly.

Select Dataset in the Run section.
Choose a dataset from your project.
Select a record to test with.
Click Test to see how your scorer evaluates the example.
Review results to identify patterns and edge cases.

Test with logs

Best for testing against actual usage patterns and debugging real-world edge cases. Use this when you want to see how your scorer performs on data your system is actually generating.

Select Logs in the Run section.
Select the project containing the logs you want to test against.
Filter logs to find relevant examples:
- Click Add filter and choose just root spans, specific span names, or a more advanced filter based on specific input, output, metadata, or other values.
- Select a timeframe.
Click Test to see how your scorer evaluates real production data.
Identify cases where the scorer needs adjustment for real-world scenarios.

To create a new online scoring rule with the filters automatically prepopulated from your current log filters, click Automations. This enables rapid iteration from logs to scoring rules. See Create scoring rules for more details.

Scorer permissions

Both LLM-as-a-judge and custom code scorers automatically receive a BRAINTRUST_API_KEY environment variable that allows them to:

Make LLM calls using organization and project AI secrets
Access attachments from the current project
Read and write logs to the current project
Read prompts from the organization

For custom code scorers that need expanded permissions beyond the current project (such as logging to other projects, reading datasets, or accessing other organization data), you can provide your own API key using the PUT /v1/env_var endpoint.

Optimize with Loop

Generate and improve scorers using Loop: Example queries:

“Write an LLM-as-a-judge scorer for a chatbot that answers product questions”
“Generate a code-based scorer based on project logs”
“Optimize the Helpfulness scorer”
“Adjust the scorer to be more lenient”

Best practices

Start with autoevals: Use pre-built scorers when they fit your needs. They’re well-tested and reliable. Be specific: Define clear evaluation criteria in your scorer prompts or code. Use multiple scorers: Measure different aspects (factuality, helpfulness, tone) with separate scorers. Choose the right scope: Use trace scorers for multi-step workflows and agents. Use span scorers for simple quality checks. Test scorers: Run scorers on known examples to verify they behave as expected. Version scorers: Like prompts, scorers are versioned automatically. Track what works. Balance cost and quality: LLM-as-a-judge scorers are more flexible but cost more and take longer than custom code scorers.

Create custom table views

The Scorers page supports custom table views to save your preferred filters, column order, and display settings. To create or update a custom table view:

Apply the filters and display settings you want.
Open the menu and select Save view… or Save view as….

Custom table views are visible to all project members. Creating or editing a table view requires the Update project permission.

Set default table views

You can set default views at two levels:

Organization default: Visible to all members when they open the page. This applies per page — for example, you can set separate organization defaults for Logs, Experiments, and Review. To set an organization default, you need the Manage settings organization permission (included by default in the Owner role). See Access control for details.
Personal default: Overrides the organization default for you only. Personal defaults are stored in your browser, so they do not carry over across devices or browsers.

To set a default view:

Switch to the view you want by selecting it from the menu.
Open the menu again and hover over the currently selected view to reveal its submenu.
Choose Set as personal default view or Set as organization default view.

To clear a default view:

Open the menu and hover over the currently selected view to reveal its submenu.
Choose Clear personal default view or Clear organization default view.

When a user opens a page, Braintrust loads the first match in this order: personal default, organization default, then the standard “All …” view (e.g., “All logs view”).

Next steps

Autoevals: Drop-in pre-built scorers
LLM-as-a-judge: Natural language evaluation criteria
Custom code: Full control over scoring logic
Run evaluations using your scorers
Score production logs with online scoring rules

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Measure output quality with scorers

Scorer types

Where to define scorers

Test scorers

Test with manual input

Test with a dataset

Test with logs

Scorer permissions

Optimize with Loop

Best practices

Create custom table views

Set default table views

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Documentation Index

​Scorer types

​Where to define scorers

​Test scorers

​Test with manual input

​Test with a dataset

​Test with logs

​Scorer permissions

​Optimize with Loop

​Best practices

​Create custom table views

​Set default table views

​Next steps

Scorer types

Where to define scorers

Test scorers

Test with manual input

Test with a dataset

Test with logs

Scorer permissions

Optimize with Loop

Best practices

Create custom table views

Set default table views

Next steps