Create experiments - Braintrust

An experiment is an immutable snapshot of an evaluation run — permanently stored, comparable over time, and shareable across your team. Unlike playground runs, which overwrite previous results for fast iteration, experiments preserve exact results so you can measure improvements, catch regressions, and build confidence in your changes.

Run locally

Run evaluation code locally to create an experiment in Braintrust and return summary metrics, including a direct link to your experiment. See Interpret results for how to read it.

Install the SDK and dependencies:

# pnpm
pnpm add braintrust openai autoevals
# npm
npm install braintrust openai autoevals

Create the eval code:

import { Eval, initDataset } from "braintrust";
import { Factuality } from "autoevals";

Eval("My Project", {
  experimentName: "My experiment",
  data: initDataset("My Project", { dataset: "My dataset" }),
  task: async (input) => {
    // Your LLM call here
    return await callModel(input);
  },
  scores: [Factuality],
  metadata: {
    model: "gpt-5-mini",
  },
});

Run your evaluation with the bt eval CLI:

bt eval my_eval.eval.ts

Use --watch to re-run automatically when files change:

bt eval --watch my_eval.eval.ts

Benefits of using the CLI:

Automatic .env loading — reads .env.development.local, .env.local, .env.development, and .env
Multi-file support — pass multiple files or directories: bt eval [file or directory] .... Running bt eval with no arguments runs all eval files in the current directory.
TypeScript transpilation — no build step required; the CLI handles it

Install the SDK and dependencies:

pip install braintrust openai autoevals

Create the eval code:

from braintrust import Eval, init_dataset
from autoevals import Factuality

Eval(
    "My project",
    experiment_name="My experiment",
    data=init_dataset(project="My project", name="My dataset"),
    task=lambda input: call_model(input),  # Your LLM call here
    scores=[Factuality],
    metadata={
        "model": "gpt-5-mini",
    },
)

Run your evaluation with the bt eval CLI:

bt eval my_eval.py

Use --watch to re-run automatically when files change:

bt eval --watch my_eval.py

Benefits of using the CLI:

Automatic .env loading — reads .env.development.local, .env.local, .env.development, and .env
Multi-file support — pass multiple files or directories: bt eval [file or directory] .... Running bt eval with no arguments runs all eval files in the current directory.
TypeScript transpilation — no build step required; the CLI handles it

Install the SDK and dependencies:

go get github.com/braintrustdata/braintrust-sdk-go
go get github.com/openai/openai-go

Create the eval code:

package main

import (
	"context"
	"log"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/sdk/trace"

	"github.com/braintrustdata/braintrust-sdk-go"
	"github.com/braintrustdata/braintrust-sdk-go/eval"
)

func callModel(input string) string {
	// Your LLM call implementation here
	return "model output"
}

func main() {
	ctx := context.Background()

	tp := trace.NewTracerProvider()
	defer tp.Shutdown(ctx)
	otel.SetTracerProvider(tp)

	client, err := braintrust.New(tp)
	if err != nil {
		log.Fatal(err)
	}

	evaluator := braintrust.NewEvaluator[string, string](client)

	_, err = evaluator.Run(ctx, eval.Opts[string, string]{
		Experiment: "My project",
		Dataset: eval.NewDataset([]eval.Case[string, string]{
			{Input: "example input", Expected: "example expected"},
		}),
		Task: eval.T(func(ctx context.Context, input string) (string, error) {
			return callModel(input), nil // Your LLM call here
		}),
		Scorers: []eval.Scorer[string, string]{
			eval.NewScorer("exact-match", func(ctx context.Context, r eval.TaskResult[string, string]) (eval.Scores, error) {
				score := 0.0
				if r.Output == r.Expected {
					score = 1.0
				}
				return eval.S(score), nil
			}),
		},
		Metadata: map[string]any{
			"model": "gpt-5-mini",
		},
	})
	if err != nil {
		log.Fatal(err)
	}
}

Run your evaluation:

go run my_eval.go

Install the SDK and dependencies:

# Add to your Gemfile:
gem "braintrust"
gem "openai"

bundle install

Create the eval code:

require "braintrust"

Braintrust.init

Braintrust::Eval.run(
  project: "My project",
  cases: [
    {input: "example input", expected: "example expected"},
  ],
  task: ->(input:) { call_model(input) },  # Your LLM call here
  scorers: [
    Braintrust::Scorer.new("exact_match") { |expected:, output:| output == expected ? 1.0 : 0.0 }
  ],
  metadata: {model: "gpt-5-mini"}
)

OpenTelemetry.tracer_provider.shutdown

Run your evaluation:

ruby my_eval.rb

Install the SDK and dependencies:

# Add to build.gradle dependencies{} block:
implementation 'dev.braintrust:braintrust-sdk-java:<version>'
implementation 'com.openai:openai-java-sdk:<version>'

Create the eval code:

import dev.braintrust.Braintrust;
import dev.braintrust.eval.DatasetCase;
import dev.braintrust.eval.Scorer;

class Main {
  static String callModel(String input) {
    // Your LLM call implementation here
    return "model output";
  }

  public static void main(String... args) {
    var braintrust = Braintrust.get();
    var openTelemetry = braintrust.openTelemetryCreate();

    var eval = braintrust.<String, String>evalBuilder()
        .name("My project")
        .cases(DatasetCase.of("example input", "example expected"))
        .taskFunction(input -> callModel(input)) // Your LLM call here
        .scorers(
            Scorer.of("exact_match", (expected, actual) -> expected.equals(actual) ? 1.0 : 0.0)
        )
        .metadata(java.util.Map.of(
            "model", "gpt-5-mini"
        ))
        .build();

    var result = eval.run();
    System.out.println(result.createReportString());
  }
}

Run your evaluation:

javac -cp ".:*" MyEval.java
java -cp ".:*" MyEval

Install the SDK and dependencies:

dotnet add package Braintrust.Sdk
dotnet add package Braintrust.Sdk.OpenAI
dotnet add package OpenAI

Create the eval code:

using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using Braintrust.Sdk;
using Braintrust.Sdk.Eval;

class Program
{
    static string CallModel(string input)
    {
        // Your LLM call implementation here
        return "model output";
    }

    static async Task Main(string[] args)
    {
        var braintrust = Braintrust.Sdk.Braintrust.Get();

        var eval = await braintrust
            .EvalBuilder<string, string>()
            .Name("My Project")
            .Cases(
                new DatasetCase<string, string>("example input", "example expected")
            )
            .TaskFunction(input => CallModel(input)) // Your LLM call here
            .Scorers(
                new FunctionScorer<string, string>("exact_match", (expected, actual) =>
                    actual == expected ? 1.0 : 0.0)
            )
            .BuildAsync();

        var result = await eval.RunAsync();
        Console.WriteLine(result.CreateReportString());
    }
}

Run your evaluation:

dotnet run

You can pass a parameters option to make configuration values (like model choice, temperature, or prompts) editable in the playground without changing code. Define parameters inline or use loadParameters() to reference saved configurations. See Write parameters and Test complex agents for details.

Run in UI

Create from scratch

Create and run experiments directly in the Braintrust UI without writing code:

Go to Experiments.
Click + Experiment or use the empty state form.
Select one or more tasks to evaluate: prompts, workflows, or remote evals and sandboxes.
Choose or create a dataset:
- Select existing dataset: Pick from datasets in your organization
- Upload CSV/JSON: Import test cases from a file
- Empty dataset: Create a blank dataset to populate manually later
Optional: Click Filter under the dataset field to scope the experiment to a subset of records. Filter by input, expected, metadata, tags, or any custom column using SQL expressions. The experiment runs only against records matching the filter, and the filter is recorded on the experiment.
Use the Scorers field to add scorers and classifiers to score and label outputs.
Click Create to execute the experiment.

UI experiments run without a time limit on cloud and on self-hosted deployments running data plane v2.0 or later.

Promote from a playground

Playground runs are mutable — re-running overwrites previous results. When you’ve iterated to a configuration worth keeping, promote it to an experiment to capture an immutable snapshot:

Run your playground.
Select + Experiment.
Name your experiment.
Access it from the Experiments page.

Each playground task maps to its own experiment. Experiments created this way are comparable to any other experiment in your project.

Run in CI/CD

Integrate evaluations into your CI/CD pipeline to catch regressions before they reach production.

GitHub Actions

Use the braintrustdata/eval-action to run evaluations on every pull request:

name: Run evaluations

on:
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'

      - name: Install dependencies
        run: npm install

      - name: Run Evals
        uses: braintrustdata/eval-action@v1
        with:
          api_key: ${{ secrets.BRAINTRUST_API_KEY }}
          runtime: node

The action automatically posts a comment with results to the pull request.

Other CI systems

For other CI systems, use bt eval directly:

BRAINTRUST_API_KEY=$BRAINTRUST_API_KEY bt eval evals/ --no-input --json

Create an API key under Settings > API keys and set it as BRAINTRUST_API_KEY in your CI environment. Use --no-input to suppress prompts and --json for machine-readable output. Use --first N or --sample N to run a non-final smoke test on pull requests and reserve the full run for merges:

bt eval tests/ --first 20 --no-input --json    # smoke run on PR, non-final
bt eval tests/ --no-input --json               # full run on merge, final

Configure experiments

Customize experiment behavior with options:

Eval("My Project", {
  data: myDataset,
  task: myTask,
  scores: [Factuality],

  // Experiment name
  experiment: "gpt-5-mini-experiment",

  // Metadata for filtering/analysis
  metadata: {
    model: "gpt-5-mini",
    prompt_version: "v2",
  },

  // Maximum concurrency
  maxConcurrency: 10,

  // Trial count for averaging
  trialCount: 3,
});

Run without uploading results

Sometimes you want to run your evaluation locally without creating an experiment in Braintrust — while iterating on a new scorer, wiring up a new eval pipeline, or running in an environment without a Braintrust API key. Your tasks and scorers still run and print a summary to your terminal; results just aren’t uploaded.

TypeScript
Python

Via the CLI:

bt eval --no-send-logs my_eval.eval.ts

Or in code:

Eval("My Project", {
  data: ...,
  task: ...,
  scores: [...],
  noSendLogs: true,
});

Via the CLI:

bt eval --no-send-logs my_eval.py

Or in code:

Eval(
    "My Project",
    data=...,
    task=...,
    scores=[...],
    no_send_logs=True,
)

Run trials

Run each input multiple times to measure variance and get more robust scores. Braintrust intelligently aggregates results by bucketing test cases with the same input value:

Eval("My Project", {
  data: myDataset,
  task: myTask,
  scores: [Factuality],
  trialCount: 10, // Run each input 10 times
});

To analyze trial results and compare variance across inputs, see Compare trials.

Override trial count per case

Individual data rows can set their own trial count to override the global default. Use this when a few inputs need extra trials to measure variance, while the rest don’t:

Eval("My Project", {
  data: [
    { input: "easy",   expected: "answer" },                  // uses global trialCount
    { input: "hard",   expected: "answer", trialCount: 5 },   // run 5 times
    { input: "stable", expected: "answer", trialCount: 1 },   // run once
  ],
  task: myTask,
  scores: [Factuality],
  trialCount: 2, // global default
});

Per-case values take precedence over the global trialCount / trial_count. If neither is set, the input runs once.

Enable hill climbing

Hill climbing lets you improve iteratively without expected outputs by using a previous experiment’s output as the expected for the current run. To enable it, use BaseExperiment() in the data field. Autoevals scorers like Battle and Summary are designed specifically for this workflow.

import { Battle } from "autoevals";
import { Eval, BaseExperiment } from "braintrust";

Eval<string, string, string>(
  "Say Hi Bot", // Replace with your project name
  {
    data: BaseExperiment(),
    task: (input) => {
      return "Hi " + input; // Replace with your task function
    },
    scores: [Battle.partial({ instructions: "Which response said 'Hi'?" })],
  },
);

Braintrust automatically picks the best base experiment using git metadata if available or timestamps otherwise, then populates the expected field by merging the expected and output fields from the base experiment. If you set expected through the UI while reviewing results, it will be used as the expected field for the next experiment. To use a specific experiment as the base, pass the name field to BaseExperiment():

import { Battle } from "autoevals";
import { Eval, BaseExperiment } from "braintrust";

Eval<string, string, string>(
  "Say Hi Bot", // Replace with your project name
  {
    data: BaseExperiment({ name: "main-123" }),
    task: (input) => {
      return "Hi " + input; // Replace with your task function
    },
    scores: [Battle.partial({ instructions: "Which response said 'Hi'?" })],
  },
);

When hill climbing, use two types of scoring functions:

Non-comparative methods like ClosedQA that judge output quality based purely on input and output without requiring an expected value. Track these across experiments to compare any two experiments, even if they aren’t sequentially related.
Comparative methods like Battle or Summary that accept an expected output but don’t treat it as ground truth. If you score > 50% on a comparative method, you’re doing better than the base on average. Learn more about how Battle and Summary work.

Create custom reporters

When you run an experiment, Braintrust logs results to your terminal, and bt eval returns a non-zero exit code if any eval throws an exception. Customize this behavior for CI/CD pipelines to precisely define what constitutes a failure or to report results to different systems. Define custom reporters using Reporter(). A reporter has two functions:

import { Reporter } from "braintrust";

Reporter(
  "My reporter", // Replace with your reporter name
  {
    reportEval(evaluator, result, opts) {
      // Summarizes the results of a single reporter, and return whatever you
      // want (the full results, a piece of text, or both!)
    },

    reportRun(results) {
      // Takes all the results and summarizes them. Return a true or false
      // which tells the process to exit.
      return true;
    },
  },
);

Any Reporter included among your evaluated files will be automatically picked up by the bt eval CLI command.

If no reporters are defined, the default reporter logs results to the console.
If you define one reporter, it’s used for all Eval blocks.
If you define multiple Reporters, specify the reporter name as an optional third argument to the eval function.

Include attachments

Braintrust allows you to log binary data like images, audio, and PDFs as attachments. Use attachments in evaluations by initializing an Attachment object in your data:

import { Eval, Attachment } from "braintrust";
import { NumericDiff } from "autoevals";
import path from "path";

function loadPdfs() {
  return ["example.pdf"].map((pdf) => ({
    input: {
      file: new Attachment({
        filename: pdf,
        contentType: "application/pdf",
        data: path.join("files", pdf),
      }),
    },
    // This is a toy example where we check that the file size is what we expect.
    expected: 469513,
  }));
}

async function getFileSize(input: { file: Attachment }) {
  return (await input.file.data()).size;
}

Eval("Project with PDFs", {
  data: loadPdfs,
  task: getFileSize,
  scores: [NumericDiff],
});

You can also store attachments in a dataset for reuse across multiple experiments. After creating the dataset, reference it by name in an eval. The attachment data is automatically downloaded from Braintrust when accessed:

import { NumericDiff } from "autoevals";
import { initDataset, Eval, ReadonlyAttachment } from "braintrust";

async function getFileSize(input: {
  file: ReadonlyAttachment;
}): Promise<number> {
  return (await input.file.data()).size;
}

Eval("Project with PDFs", {
  data: initDataset({
    project: "Project with PDFs",
    dataset: "My PDF Dataset",
  }),
  task: getFileSize,
  scores: [NumericDiff],
});

To forward an attachment to an external service like OpenAI, obtain a signed URL instead of downloading the data directly:

import { initDataset, wrapOpenAI, ReadonlyAttachment } from "braintrust";
import { OpenAI } from "openai";

const client = wrapOpenAI(
  new OpenAI({
    apiKey: process.env.OPENAI_API_KEY,
  }),
);

async function main() {
  const dataset = initDataset({
    project: "Project with images",
    dataset: "My Image Dataset",
  });
  for await (const row of dataset) {
    const attachment: ReadonlyAttachment = row.input.file;
    const attachmentUrl = (await attachment.metadata()).downloadUrl;
    const response = await client.chat.completions.create({
      model: "gpt-5-mini",
      messages: [
        {
          role: "system",
          content: "You are a helpful assistant",
        },
        {
          role: "user",
          content: [
            { type: "text", text: "Please summarize the attached image" },
            { type: "image_url", image_url: { url: attachmentUrl } },
          ],
        },
      ],
    });
    const summary = response.choices[0].message.content || "Unknown";
    console.log(
      `Summary for file ${attachment.reference.filename}: ${summary}`,
    );
  }
}

main();

Trace your evals

Add detailed tracing to your evaluation task functions to measure performance and debug issues. Each span in the trace represents an operation like an LLM call, database lookup, or API request.

Use wrapOpenAI/wrap_openai to automatically trace OpenAI API calls. See Trace LLM calls for details.

Each call to experiment.log() creates its own trace. Do not mix experiment.log() with tracing functions like traced() - this creates incorrectly parented traces.

Wrap task code with traced() to log incrementally to spans. This example progressively logs input, output, and metrics:

import { Eval, traced } from "braintrust";

async function callModel(input: string) {
  return traced(
    async (span) => {
      const messages = { messages: [{ role: "system", text: input }] };
      span.log({ input: messages });

      // Replace this with a model call
      const result = {
        content: "China",
        latency: 1,
        prompt_tokens: 10,
        completion_tokens: 2,
      };

      span.log({
        output: result.content,
        metrics: {
          latency: result.latency,
          prompt_tokens: result.prompt_tokens,
          completion_tokens: result.completion_tokens,
        },
      });
      return result.content;
    },
    {
      name: "My AI model",
    },
  );
}

const exactMatch = (args: {
  input: string;
  output: string;
  expected?: string;
}) => {
  return {
    name: "Exact match",
    score: args.output === args.expected ? 1 : 0,
  };
};

Eval("My Evaluation", {
  data: () => [
    { input: "Which country has the highest population?", expected: "China" },
  ],
  task: async (input, { span }) => {
    return await callModel(input);
  },
  scores: [exactMatch],
});

This creates a span tree you can visualize in the UI by clicking on each test case in the experiment.

Troubleshooting

Evaluations running slowly with maxConcurrency?

If your evaluations are slower than expected when using maxConcurrency, you may be on an older SDK version that flushes logs after every single task completion. Upgrade to TypeScript SDK v3.3.0+ for up to an 8x performance improvement. The SDK now uses byte-based backpressure for better flushing performance.You can tune the flush threshold with the BRAINTRUST_FLUSH_BACKPRESSURE_BYTES environment variable. See Tune performance for all available configuration options.

Task function throws an exception during eval (C# SDK v0.2.2+)

When the task function throws, the C# eval framework catches the exception, records it on the task span and root span (with ActivityStatusCode.Error), and calls ScoreForTaskException on every scorer instead of Score. The eval continues — no cases are skipped.By default, ScoreForTaskException returns a single score of 0.0. Override it on your IScorer to return a custom fallback score, return an empty list to omit scoring for that case, or re-throw to abort the eval.

#skip-compile

using Braintrust.Sdk.Eval;

sealed class MyScorer : IScorer<string, string>
{
    public string Name => "my_scorer";

    public Task<IReadOnlyList<Score>> Score(TaskResult<string, string> taskResult)
    {
        var matches = taskResult.Result == taskResult.DatasetCase.Expected;
        return Task.FromResult<IReadOnlyList<Score>>([new Score(Name, matches ? 1.0 : 0.0)]);
    }

    // Called instead of Score() when the task function threw.
    // Return [] to skip recording a score; throw to abort the eval.
    public Task<IReadOnlyList<Score>> ScoreForTaskException(
        Exception taskException,
        DatasetCase<string, string> datasetCase)
    {
        // Distinguish between expected and unexpected failures
        if (taskException is TimeoutException)
            return Task.FromResult<IReadOnlyList<Score>>([new Score(Name, 0.0)]);

        return Task.FromResult<IReadOnlyList<Score>>([]); // skip scoring
    }
}

The task span and root eval span both receive an OTel exception event with exception.type, exception.message, and exception.stacktrace attributes, visible in any OTel-compatible backend connected to Braintrust.

Scorer throws an exception during eval (C# SDK v0.2.2+)

When a scorer’s Score method throws, the exception is recorded on that scorer’s span (with ActivityStatusCode.Error and an OTel exception event) and ScoreForScorerException is called as a fallback. Other scorers continue running unaffected.By default, ScoreForScorerException returns a single score of 0.0. Override it to return a custom fallback, return an empty list to omit the score, or re-throw to abort the eval.

#skip-compile

using Braintrust.Sdk.Eval;

sealed class MyScorer : IScorer<string, string>
{
    public string Name => "my_scorer";

    public Task<IReadOnlyList<Score>> Score(TaskResult<string, string> taskResult)
    {
        // ... scoring logic that might throw
        throw new InvalidOperationException("unexpected output format");
    }

    // Called when Score() throws. Other scorers are not affected.
    // Return [] to skip recording a score; throw to abort the eval.
    public Task<IReadOnlyList<Score>> ScoreForScorerException(
        Exception scorerException,
        TaskResult<string, string> taskResult)
    {
        return Task.FromResult<IReadOnlyList<Score>>([new Score(Name, 0.0)]);
    }
}

Score spans are named score:<scorer_name> (e.g. score:my_scorer), making individual scorer traces distinguishable in Braintrust and any connected OTel backend.

Next steps

Interpret results from your experiments
Compare experiments to measure improvements
Test complex agents to connect custom code to the playground
Write scorers to measure quality

​Run locally

​Run in UI

​Create from scratch

​Promote from a playground

​Run in CI/CD

​GitHub Actions

​Other CI systems

​Configure experiments

​Run without uploading results

​Run trials

​Override trial count per case

​Enable hill climbing

​Create custom reporters

​Include attachments

​Trace your evals

​Troubleshooting

​Next steps

Run locally

Run in UI

Create from scratch

Promote from a playground

Run in CI/CD

GitHub Actions

Other CI systems

Configure experiments

Run without uploading results

Run trials

Override trial count per case

Enable hill climbing

Create custom reporters

Include attachments

Trace your evals

Troubleshooting

Next steps