Interpret evaluation results

Each offline evaluation creates an experiment, a permanent record of how the evaluated task performed on a dataset.

View results

To view the results of an experiment, go to Experiments in your project and select the experiment from the list.

Traces vs. spans - By default, experiments display as a table of traces where each row represents a complete trace with its root span. To view the individual spans in traces instead, select Display > Row type > Spans. View individual spans when you want to:
- Analyze specific operations within traces
- Find particular function calls or API requests
- Examine timing and token usage for individual operations
Spans view is optimized for analyzing individual operations. Experiment comparisons and diff mode are only available when viewing traces.
Metrics - Along with the scores you track, Braintrust tracks a number of metrics about your LLM calls that help you assess and understand performance. For example, if you’re trying to figure out why the average duration increased substantially when you change a model, it’s useful to look at both duration and token metrics to diagnose the underlying issue. To compute LLM metrics like token counts, make sure you wrap your LLM calls.
Experiment summary - Select Details to view:
- Comparisons to other experiments
- Scorers used in the evaluation
- Datasets tested
- Saved parameters linked to the evaluation
- Metadata like model and parameters
Copy the experiment ID from the bottom of the summary pane for referencing in code or sharing with teammates.

Filter results

Each project provides default table views with common filters for experiments, including:

Default view: Shows all traces in the experiment
Non-errors: Shows only traces without errors
Errors: Shows only traces with errors
Scorer errors: Show only traces with scorer errors
Unreviewed: Hides traces that have been human-reviewed
Assigned to me: Shows only traces assigned to the current user for human review

Use the menu to switch the table view.

Built-in views (such as “All experiments view”) cannot be modified, but you can create custom table views based on custom filters and display settings.

You can also use the Filter menu to add custom filtering. Use the Basic tab for point-and-click filtering, or switch to SQL to write precise SQL queries. To filter experiments by metadata programmatically, use the metadata query parameter on GET /v1/experiment. See Filter experiments by metadata for details.

Group results

Select Display > Group by to group the table by metadata fields to see patterns. By default, group rows show one experiment’s summary data. To view summary data for all experiments, select Include comparisons in group.

Order by regressions

Score and metric columns show summary statistics in their headers. To order columns by regressions, select Display > Columns > Order by regressions. Within grouped tables, this sorts rows by regressions of a specific score relative to a comparison experiment.

Examine a trace

Select any row to open the trace view and see complete details:

Input, output, and expected values
Metadata and parameters
All spans in the trace hierarchy
Scores and their explanations
Timing and token usage

Ask yourself: Do good scores correspond to good outputs? If not, update your scorers or test cases. Use the button to expand the trace to fullscreen or the button to open it in a separate page. For details on trace views, layouts, and actions, see Examine traces.

When comparing experiments with diff mode enabled, only the default trace view is available. Timeline, Thread, and custom views are disabled during comparison.

Assign for review

You can assign experiment rows to team members for review, analysis, or follow-up action. Assignments are particularly useful for human review workflows, where you can assign specific rows that need human evaluation and distribute review work across multiple team members. See Assign rows for review for details.

Score retrospectively

Apply scorers to existing experiments:

Multiple cases: Select rows and use Score to apply chosen scorers
Single case: Open a trace and use Score in the trace view

Scores appear as additional spans within the trace.

Analyze with Loop

Use Loop to analyze experiment results, identify patterns, and get improvement suggestions. Loop can help you understand why certain test cases succeeded or failed and generate actionable recommendations. Select one or more experiments and open Loop to:

Summarize results: Get high-level insights about experiment performance, score trends, and key differences between experiments.
Drill into specific rows: Ask Loop to analyze test cases that performed poorly or identify patterns across failures.
Generate improvements: Loop can suggest changes to prompts, scorers, or datasets based on experiment results.
Create datasets: Extract problematic or interesting test cases into new datasets for targeted evaluation.
Generate code: Get sample code for implementing improvements to test in your next experiment.

Example queries:

“What improved from the last experiment?”
“Categorize the errors in this experiment”
“Pick the best scorers for this task”
“Why did the factuality score drop?”
“Create a dataset from the rows where the model failed”
“What patterns do you see in the low-scoring cases?”

Use aggregate scores

Aggregate scores are formulas that combine multiple scores into a single metric. They are useful when you track many scores but need a single metric to represent overall experiment quality. See Create aggregate scores for more details.

Download results

To download an experiment’s results, select and then Download as CSV or Download as JSON.

Customize the experiments table

Adjust table layout

To switch between different layouts, select Display > Layout and one of the following:

List: Default table view
Grid: Compare outputs side-by-side
Summary: Large-type summary of scores and metrics across all experiments
Summary table: Scores and metrics as rows with experiments as columns, with a PDF download option

Layouts respect view filters and are automatically saved when you save a view.

Show and hide columns

Select Display > Columns and then:

Show or hide columns to focus on relevant data
Reorder columns by dragging them
Pin important columns to the left

All column settings are automatically saved when you save a view. When topics are enabled, facet outputs appear as columns in the experiments table, similar to scores. You can filter and sort by facet columns to analyze patterns in your evaluation results. This helps identify which types of inputs (e.g., specific user tasks or sentiment categories) perform better or worse in your experiments.

Create custom columns

Extract specific values from traces using custom columns:

Select Display > Columns > + Add custom column.
Name your column.
Choose from inferred fields or write a SQL expression.

Once created, filter and sort using your custom columns.

Create custom table views

To create or update a custom table view:

Apply the filters and display settings you want.
Open the menu and select Save view… or Save view as….

Custom table views are visible to all project members. Creating or editing a table view requires the Update project permission.

Set default table views

You can set default views at two levels:

Organization default: Visible to all members when they open the page. This applies per page — for example, you can set separate organization defaults for Logs, Experiments, and Review. To set an organization default, you need the Manage settings organization permission (included by default in the Owner role). See Access control for details.
Personal default: Overrides the organization default for you only. Personal defaults are stored in your browser, so they do not carry over across devices or browsers.

To set a default view:

Switch to the view you want by selecting it from the menu.
Open the menu again and hover over the currently selected view to reveal its submenu.
Choose Set as personal default view or Set as organization default view.

To clear a default view:

Open the menu and hover over the currently selected view to reveal its submenu.
Choose Clear personal default view or Clear organization default view.

When a user opens a page, Braintrust loads the first match in this order: personal default, organization default, then the standard “All …” view (e.g., “All logs view”).

Change the table density

To change the table density to see more or less detail per row, select Display > Row height > Compact or Tall.

Export experiments

To export an experiment’s results, open the menu next to the experiment name. You can export as CSV or JSON, and choose whether to download all fields.

Access data from previous experiments by passing the open flag to init():

import { init } from "braintrust";

async function openExperiment() {
  const experiment = init("My Project", {
    experiment: "my-experiment",
    open: true,
  });

  for await (const testCase of experiment) {
    console.log(testCase);
  }
}

Convert experiments to dataset format using asDataset()/as_dataset():

import { init } from "braintrust";

async function openExperiment() {
  const experiment = init("My Project", {
    experiment: "my-experiment",
    open: true,
  });

  for await (const testCase of experiment.asDataset()) {
    console.log(testCase);
  }
}

Fetch experiment events via the API using Fetch experiment (POST form) or Fetch experiment (GET form).You can also query experiments with SQL for custom analysis. For example, to check review status:

import os
import requests

API_URL = "https://api.braintrust.dev/"
headers = {"Authorization": "Bearer " + os.environ["BRAINTRUST_API_KEY"]}

def fetch_experiment_review_status(experiment_id: str) -> dict:
    # Replace "response quality" with your review score column name
    query = f"""
    SELECT
      sum(CASE WHEN scores."response quality" IS NOT NULL THEN 1 ELSE 0 END) AS reviewed,
      sum(CASE WHEN is_root THEN 1 ELSE 0 END) AS total
    FROM experiment('{experiment_id}')
    """

    return requests.post(
        f"{API_URL}/btql",
        headers=headers,
        json={"query": query, "fmt": "json"},
    ).json()

# Usage
result = fetch_experiment_review_status("your-experiment-id")
print(f"Reviewed: {result['data'][0]['reviewed']}/{result['data'][0]['total']}")

Download experiment data to a local NDJSON file with bt sync pull:

bt sync pull experiment:my-experiment

Query experiment data with SQL using bt sql:

bt sql "SELECT id, input, output, scores FROM experiment('my-experiment')"

Next steps

Compare experiments systematically
Write scorers to measure what matters
Use playgrounds for rapid iteration
Run evaluations in CI/CD

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Interpret evaluation results

View results

Filter results

Group results

Order by regressions

Examine a trace

Assign for review

Score retrospectively

Analyze with Loop

Use aggregate scores

Download results

Customize the experiments table

Adjust table layout

Show and hide columns

Create custom columns

Create custom table views

Set default table views

Change the table density

Export experiments

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Documentation Index

​View results

​Filter results

​Group results

​Order by regressions

​Examine a trace

​Assign for review

​Score retrospectively

​Analyze with Loop

​Use aggregate scores

​Download results

​Customize the experiments table

​Adjust table layout

​Show and hide columns

​Create custom columns

​Create custom table views

​Set default table views

​Change the table density

​Export experiments

​Next steps

View results

Filter results

Group results

Order by regressions

Examine a trace

Assign for review

Score retrospectively

Analyze with Loop

Use aggregate scores

Download results

Customize the experiments table

Adjust table layout

Show and hide columns

Create custom columns

Create custom table views

Set default table views

Change the table density

Export experiments

Next steps