bt eval

Run evaluation files against Braintrust. Supports JavaScript and Python.

bt eval is currently macOS and Linux only.

File selection

bt eval — discover and run all eval files in the current directory (recursive)
bt eval tests/ — discover eval files under a specific directory
bt eval "tests/**/*.eval.ts" — glob pattern
bt eval a.eval.ts b.eval.ts — one or more explicit files

Files inside node_modules, .venv, venv, site-packages, dist-packages, and __pycache__ are excluded from automatic discovery. Explicit paths and globs bypass these exclusions.

JavaScript runners

Requires Node.js 18.19.0+ or 20.6.0+. Bun 1.0+ and Deno with Node compatibility mode are also supported. By default, bt eval auto-detects a runner from your project (tsx, vite-node, ts-node, then ts-node-esm). Set one explicitly with --runner / BT_EVAL_RUNNER:

bt eval --runner vite-node tutorial.eval.ts
bt eval --runner tsx tutorial.eval.ts

bt eval automatically resolves locally installed binaries from node_modules/.bin, so you can write --runner tsx instead of --runner ./node_modules/.bin/tsx (for example). If you see ESM or top-level await errors, try --runner vite-node.

Python

bt eval also runs Python eval files. Use --language py to force language detection. By default, if VIRTUAL_ENV is set, bt uses that virtualenv’s Python; otherwise it searches PATH for python3 or python. To use a specific interpreter, set BT_EVAL_PYTHON_RUNNER to its name or path (e.g. python3.11). The --num-workers flag controls concurrency for Python execution.

bt eval my_eval.py
bt eval --language py --num-workers 4 my_eval.py

Sampling modes

Run a subset of your evaluation data as a non-final smoke run to catch obvious regressions before committing to the full dataset.

bt eval --first 20 qa.eval.ts          # First 20 examples, non-final
bt eval --sample 20 qa.eval.ts         # Random 20 examples, non-final
bt eval --sample 20 --sample-seed 7 qa.eval.ts  # Reproducible random sample
bt eval qa.eval.ts                     # Full dataset, final

When --first or --sample is used, the experiment summary is labeled as non-final in Braintrust. Omitting both flags runs the full dataset and marks the summary as final.

Flags

Flag	Env var	Description
`--runner <RUNNER>`	`BT_EVAL_RUNNER`	Runner binary (`tsx`, `bun`, `ts-node`, `python`, etc.)
`--language <LANG>`	`BT_EVAL_LANGUAGE`	Force language: `js` or `py`
`--filter <PATTERN>`	`BT_EVAL_FILTER`	Run only evaluators matching the pattern
`--first <N>`	`BT_EVAL_FIRST`	Run only the first N examples (non-final smoke run)
`--sample <N>`	`BT_EVAL_SAMPLE`	Run a deterministic random sample of N examples (non-final smoke run)
`--sample-seed <S>`	`BT_EVAL_SAMPLE_SEED`	Integer seed for `--sample` (default: `0`)
`--param <KEY=VALUE>`	`BT_EVAL_PARAMS_JSON`	Pass a named parameter into evaluators that declare a parameters schema (repeatable; also accepts a JSON object string)
`--watch` / `-w`	`BT_EVAL_WATCH`	Re-run when input files change
`--no-send-logs`	`BT_EVAL_LOCAL`	Run without sending results to Braintrust
`--num-workers <N>`		Worker threads for Python execution
`--verbose`		Show full errors and stderr from eval files
`--list`		List evaluators without running them
`--jsonl`		Output one JSON summary per evaluator (for scripts). See also the global `--json` flag (overview), which formats all CLI output as JSON rather than per-evaluator summaries.
`--terminate-on-failure`		Stop after the first failing evaluator
`--dev`		Start a local web server for browser-based eval development (default port: 8300)

Summary output

When using --jsonl or reading SSE output, each evaluator summary object includes these fields:

Field	Type	Description
`runMode`	`"full"` \| `"first"` \| `"sample"`	How the eval was run
`isFinal`	`boolean`	Whether this is a final (full-dataset) run
`runLabel`	`string`	Human-readable description of the run mode
`sampleCount`	`number`	Number of examples sampled (only present when `--first` or `--sample` is used)
`sampleSeed`	`number`	Seed used for random sampling (only present when `--sample` is used)

Parameters

Pass runtime parameter values into evaluators that declare a parameters schema using --param. Each evaluator only receives the keys it declares. Extra keys are silently filtered, so a single command can target multiple evaluators with different schemas without errors.

bt eval --param model=gpt-4o --param count=5 my.eval.ts
bt eval --param '{"model":"gpt-4o","count":5}' my.eval.ts

Parameters are validated against the evaluator’s declared schema before execution. Evaluators without a parameters schema are unaffected.

Passing arguments to the eval file

Use -- to forward extra arguments to the eval file via process.argv:

bt eval foo.eval.ts -- --description "Prod" --shard 1/4

Running in CI

Set BRAINTRUST_API_KEY instead of using OAuth login:

# GitHub Actions example
- name: Run evals
  env:
    BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
  run: bt eval tests/

Use --no-input and --json for non-interactive output:

BRAINTRUST_API_KEY=... bt eval tests/ --no-input --json

SDKs

API

CLI

Other

File selection

JavaScript runners

Python

Sampling modes

Flags

Summary output

Parameters

Passing arguments to the eval file

Running in CI

SDKs

API

CLI

Other

Documentation Index

​File selection

​JavaScript runners

​Python

​Sampling modes

​Flags

​Summary output

​Parameters

​Passing arguments to the eval file

​Running in CI

File selection

JavaScript runners

Python

Sampling modes

Flags

Summary output

Parameters

Passing arguments to the eval file

Running in CI