Use this file to discover all available pages before exploring further.
An experiment is an immutable snapshot of an evaluation run — permanently stored, comparable over time, and shareable across your team. Unlike playground runs, which overwrite previous results for fast iteration, experiments preserve exact results so you can measure improvements, catch regressions, and build confidence in your changes.
Run evaluation code locally to create an experiment in Braintrust and return summary metrics, including a direct link to your experiment. See Interpret results for how to read it.
Use --watch to re-run automatically when files change:
bt eval --watch my_eval.eval.ts
Benefits of using the CLI:
Automatic .env loading — reads .env.development.local, .env.local, .env.development, and .env
Multi-file support — pass multiple files or directories: bt eval [file or directory] .... Running bt eval with no arguments runs all eval files in the current directory.
TypeScript transpilation — no build step required; the CLI handles it
Install the SDK and dependencies:
pip install braintrust openai autoevals
Create the eval code:
from braintrust import Eval, init_datasetfrom autoevals import FactualityEval( "My project", experiment_name="My experiment", data=init_dataset(project="My project", name="My dataset"), task=lambda input: call_model(input), # Your LLM call here scores=[Factuality], metadata={ "model": "gpt-5-mini", },)
Use --watch to re-run automatically when files change:
bt eval --watch my_eval.py
Benefits of using the CLI:
Automatic .env loading — reads .env.development.local, .env.local, .env.development, and .env
Multi-file support — pass multiple files or directories: bt eval [file or directory] .... Running bt eval with no arguments runs all eval files in the current directory.
TypeScript transpilation — no build step required; the CLI handles it
Install the SDK and dependencies:
go get github.com/braintrustdata/braintrust-sdk-gogo get github.com/openai/openai-go
using System;using System.Collections.Generic;using System.Threading.Tasks;using Braintrust.Sdk;using Braintrust.Sdk.Eval;class Program{ static string CallModel(string input) { // Your LLM call implementation here return "model output"; } static async Task Main(string[] args) { var braintrust = Braintrust.Sdk.Braintrust.Get(); var eval = await braintrust .EvalBuilder<string, string>() .Name("My Project") .Cases( new DatasetCase<string, string>("example input", "example expected") ) .TaskFunction(input => CallModel(input)) // Your LLM call here .Scorers( new FunctionScorer<string, string>("exact_match", (expected, actual) => actual == expected ? 1.0 : 0.0) ) .BuildAsync(); var result = await eval.RunAsync(); Console.WriteLine(result.CreateReportString()); }}
Run your evaluation:
dotnet run
You can pass a parameters option to make configuration values (like model choice, temperature, or prompts) editable in the playground without changing code. Define parameters inline or use loadParameters() to reference saved configurations. See Write parameters and Test complex agents for details.
Playground runs are mutable — re-running overwrites previous results. When you’ve iterated to a configuration worth keeping, promote it to an experiment to capture an immutable snapshot:
Run your playground.
Select + Experiment.
Name your experiment.
Access it from the Experiments page.
Each playground task maps to its own experiment. Experiments created this way are comparable to any other experiment in your project.
Create an API key under and set it as BRAINTRUST_API_KEY in your CI environment. Use --no-input to suppress prompts and --json for machine-readable output.Use --first N or --sample N to run a non-final smoke test on pull requests and reserve the full run for merges:
bt eval tests/ --first 20 --no-input --json # smoke run on PR, non-finalbt eval tests/ --no-input --json # full run on merge, final
Sometimes you want to run your evaluation locally without creating an experiment in Braintrust — while iterating on a new scorer, wiring up a new eval pipeline, or running in an environment without a Braintrust API key. Your tasks and scorers still run and print a summary to your terminal; results just aren’t uploaded.
Run each input multiple times to measure variance and get more robust scores. Braintrust intelligently aggregates results by bucketing test cases with the same input value:
Eval("My Project", { data: myDataset, task: myTask, scores: [Factuality], trialCount: 10, // Run each input 10 times});
To analyze trial results and compare variance across inputs, see Compare trials.
Hill climbing lets you improve iteratively without expected outputs by using a previous experiment’s output as the expected for the current run. To enable it, use BaseExperiment() in the data field. Autoevals scorers like Battle and Summary are designed specifically for this workflow.
import { Battle } from "autoevals";import { Eval, BaseExperiment } from "braintrust";Eval<string, string, string>( "Say Hi Bot", // Replace with your project name { data: BaseExperiment(), task: (input) => { return "Hi " + input; // Replace with your task function }, scores: [Battle.partial({ instructions: "Which response said 'Hi'?" })], },);
Braintrust automatically picks the best base experiment using git metadata if available or timestamps otherwise, then populates the expected field by merging the expected and output fields from the base experiment. If you set expected through the UI while reviewing results, it will be used as the expected field for the next experiment.To use a specific experiment as the base, pass the name field to BaseExperiment():
import { Battle } from "autoevals";import { Eval, BaseExperiment } from "braintrust";Eval<string, string, string>( "Say Hi Bot", // Replace with your project name { data: BaseExperiment({ name: "main-123" }), task: (input) => { return "Hi " + input; // Replace with your task function }, scores: [Battle.partial({ instructions: "Which response said 'Hi'?" })], },);
When hill climbing, use two types of scoring functions:
Non-comparative methods like ClosedQA that judge output quality based purely on input and output without requiring an expected value. Track these across experiments to compare any two experiments, even if they aren’t sequentially related.
Comparative methods like Battle or Summary that accept an expected output but don’t treat it as ground truth. If you score > 50% on a comparative method, you’re doing better than the base on average. Learn more about how Battle and Summary work.
When you run an experiment, Braintrust logs results to your terminal, and bt eval returns a non-zero exit code if any eval throws an exception. Customize this behavior for CI/CD pipelines to precisely define what constitutes a failure or to report results to different systems.Define custom reporters using Reporter(). A reporter has two functions:
import { Reporter } from "braintrust";Reporter( "My reporter", // Replace with your reporter name { reportEval(evaluator, result, opts) { // Summarizes the results of a single reporter, and return whatever you // want (the full results, a piece of text, or both!) }, reportRun(results) { // Takes all the results and summarizes them. Return a true or false // which tells the process to exit. return true; }, },);
Any Reporter included among your evaluated files will be automatically picked up by the bt eval CLI command.
If no reporters are defined, the default reporter logs results to the console.
If you define one reporter, it’s used for all Eval blocks.
If you define multiple Reporters, specify the reporter name as an optional third argument to the eval function.
Braintrust allows you to log binary data like images, audio, and PDFs as attachments. Use attachments in evaluations by initializing an Attachment object in your data:
import { Eval, Attachment } from "braintrust";import { NumericDiff } from "autoevals";import path from "path";function loadPdfs() { return ["example.pdf"].map((pdf) => ({ input: { file: new Attachment({ filename: pdf, contentType: "application/pdf", data: path.join("files", pdf), }), }, // This is a toy example where we check that the file size is what we expect. expected: 469513, }));}async function getFileSize(input: { file: Attachment }) { return (await input.file.data()).size;}Eval("Project with PDFs", { data: loadPdfs, task: getFileSize, scores: [NumericDiff],});
You can also store attachments in a dataset for reuse across multiple experiments. After creating the dataset, reference it by name in an eval. The attachment data is automatically downloaded from Braintrust when accessed:
import { NumericDiff } from "autoevals";import { initDataset, Eval, ReadonlyAttachment } from "braintrust";async function getFileSize(input: { file: ReadonlyAttachment;}): Promise<number> { return (await input.file.data()).size;}Eval("Project with PDFs", { data: initDataset({ project: "Project with PDFs", dataset: "My PDF Dataset", }), task: getFileSize, scores: [NumericDiff],});
To forward an attachment to an external service like OpenAI, obtain a signed URL instead of downloading the data directly:
import { initDataset, wrapOpenAI, ReadonlyAttachment } from "braintrust";import { OpenAI } from "openai";const client = wrapOpenAI( new OpenAI({ apiKey: process.env.OPENAI_API_KEY, }),);async function main() { const dataset = initDataset({ project: "Project with images", dataset: "My Image Dataset", }); for await (const row of dataset) { const attachment: ReadonlyAttachment = row.input.file; const attachmentUrl = (await attachment.metadata()).downloadUrl; const response = await client.chat.completions.create({ model: "gpt-5-mini", messages: [ { role: "system", content: "You are a helpful assistant", }, { role: "user", content: [ { type: "text", text: "Please summarize the attached image" }, { type: "image_url", image_url: { url: attachmentUrl } }, ], }, ], }); const summary = response.choices[0].message.content || "Unknown"; console.log( `Summary for file ${attachment.reference.filename}: ${summary}`, ); }}main();
Add detailed tracing to your evaluation task functions to measure performance and debug issues. Each span in the trace represents an operation like an LLM call, database lookup, or API request.
Use wrapOpenAI/wrap_openai to automatically trace OpenAI API calls. See Trace LLM calls for details.
Each call to experiment.log() creates its own trace. Do not mix experiment.log() with tracing functions like traced() - this creates incorrectly parented traces.
Wrap task code with traced() to log incrementally to spans. This example progressively logs input, output, and metrics:
If your evaluations are slower than expected when using maxConcurrency, you may be on an older SDK version that flushes logs after every single task completion. Upgrade to TypeScript SDK v3.3.0+ for up to an 8x performance improvement. The SDK now uses byte-based backpressure for better flushing performance.You can tune the flush threshold with the BRAINTRUST_FLUSH_BACKPRESSURE_BYTES environment variable. See Tune performance for all available configuration options.
Task function throws an exception during eval (C# SDK v0.2.2+)
When the task function throws, the C# eval framework catches the exception, records it on the task span and root span (with ActivityStatusCode.Error), and calls ScoreForTaskException on every scorer instead of Score. The eval continues — no cases are skipped.By default, ScoreForTaskException returns a single score of 0.0. Override it on your IScorer to return a custom fallback score, return an empty list to omit scoring for that case, or re-throw to abort the eval.
#skip-compile
using Braintrust.Sdk.Eval;sealed class MyScorer : IScorer<string, string>{ public string Name => "my_scorer"; public Task<IReadOnlyList<Score>> Score(TaskResult<string, string> taskResult) { var matches = taskResult.Result == taskResult.DatasetCase.Expected; return Task.FromResult<IReadOnlyList<Score>>([new Score(Name, matches ? 1.0 : 0.0)]); } // Called instead of Score() when the task function threw. // Return [] to skip recording a score; throw to abort the eval. public Task<IReadOnlyList<Score>> ScoreForTaskException( Exception taskException, DatasetCase<string, string> datasetCase) { // Distinguish between expected and unexpected failures if (taskException is TimeoutException) return Task.FromResult<IReadOnlyList<Score>>([new Score(Name, 0.0)]); return Task.FromResult<IReadOnlyList<Score>>([]); // skip scoring }}
The task span and root eval span both receive an OTel exception event with exception.type, exception.message, and exception.stacktrace attributes, visible in any OTel-compatible backend connected to Braintrust.
Scorer throws an exception during eval (C# SDK v0.2.2+)
When a scorer’s Score method throws, the exception is recorded on that scorer’s span (with ActivityStatusCode.Error and an OTel exception event) and ScoreForScorerException is called as a fallback. Other scorers continue running unaffected.By default, ScoreForScorerException returns a single score of 0.0. Override it to return a custom fallback, return an empty list to omit the score, or re-throw to abort the eval.
#skip-compile
using Braintrust.Sdk.Eval;sealed class MyScorer : IScorer<string, string>{ public string Name => "my_scorer"; public Task<IReadOnlyList<Score>> Score(TaskResult<string, string> taskResult) { // ... scoring logic that might throw throw new InvalidOperationException("unexpected output format"); } // Called when Score() throws. Other scorers are not affected. // Return [] to skip recording a score; throw to abort the eval. public Task<IReadOnlyList<Score>> ScoreForScorerException( Exception scorerException, TaskResult<string, string> taskResult) { return Task.FromResult<IReadOnlyList<Score>>([new Score(Name, 0.0)]); }}
Score spans are named score:<scorer_name> (e.g. score:my_scorer), making individual scorer traces distinguishable in Braintrust and any connected OTel backend.