Skip to content

Tests

Tests are individual test entries within an evaluation file. Each test defines input messages, expected outcomes, and optional evaluator overrides.

tests:
- id: addition
criteria: Correctly calculates 15 + 27 = 42
input: What is 15 + 27?
expected_output: "42"
FieldRequiredDescription
idYesUnique identifier for the test
criteriaYesDescription of what a correct response should contain
inputYesInput sent to the target (string, object, or message array). Alias: input
expected_outputNoExpected response for comparison (string, object, or message array). Alias: expected_output
executionNoPer-case execution overrides (for example target, skip_defaults)
workspaceNoPer-case workspace config (overrides suite-level)
metadataNoArbitrary key-value pairs passed to evaluators and workspace scripts
rubricsNoStructured evaluation criteria
assertionsNoPer-test evaluators

The simplest form is a string, which expands to a single user message:

input: What is 15 + 27?

For multi-turn or system messages, use a message array:

input:
- role: system
content: You are a helpful math tutor.
- role: user
content: What is 15 + 27?

When suite-level input is defined in the eval file, those messages are prepended to the test’s input. See Suite-level Input.

Optional reference response for comparison by evaluators. A string expands to a single assistant message:

expected_output: "42"

For structured or multi-message expected output, use a message array:

expected_output:
- role: assistant
content: "42"

Override the default target or evaluators for specific tests:

tests:
- id: complex-case
criteria: Provides detailed explanation
input: Explain quicksort algorithm
execution:
target: gpt4_target
assertions:
- name: depth_check
type: llm-grader
prompt: ./graders/depth.md

Per-case assertions evaluators are merged with root-level assertions evaluators — test-specific evaluators run first, then root-level defaults are appended. To opt out of root-level defaults for a specific test, set execution.skip_defaults: true:

assertions:
- name: latency_check
type: latency
threshold: 5000
tests:
- id: normal-case
criteria: Returns correct answer
input: What is 2+2?
# Gets latency_check from root-level assertions
- id: special-case
criteria: Handles edge case
input: Handle this edge case
execution:
skip_defaults: true
assertions:
- name: custom_eval
type: llm-grader
# Does NOT get latency_check

Override the suite-level workspace config for individual tests. Test-level fields replace suite-level fields:

workspace:
hooks:
before_all:
command: ["bun", "run", "default-setup.ts"]
tests:
- id: case-1
criteria: Should work
input: Do something
workspace:
hooks:
before_all:
command: ["bun", "run", "custom-setup.ts"]
- id: case-2
criteria: Should also work
input: Do something else
# Inherits suite-level hooks.before_all

See Workspace Lifecycle Hooks for the full workspace config reference.

Pass arbitrary key-value pairs to lifecycle commands via the metadata field. This is useful for benchmark datasets where each case needs repo info, commit hashes, or other context:

tests:
- id: sympy-20590
criteria: Bug should be fixed
input: Fix the diophantine equation bug
metadata:
repo: sympy/sympy
base_commit: "abc123def"
workspace:
hooks:
before_all:
command: ["python", "checkout_repo.py"]

The metadata field is included in the stdin JSON passed to lifecycle commands as case_metadata.

The assertions field defines evaluators directly on a test. It supports both deterministic assertion types and LLM-based rubric evaluation.

These evaluators run without an LLM call and produce binary (0 or 1) scores:

TypeValueDescription
containsstringPass if output includes the substring
contains-anystring[]Pass if output includes ANY of the strings
contains-allstring[]Pass if output includes ALL of the strings
icontainsstringCase-insensitive contains
icontains-anystring[]Case-insensitive contains-any
icontains-allstring[]Case-insensitive contains-all
starts-withstringPass if output starts with value (trimmed)
ends-withstringPass if output ends with value (trimmed)
regexstringPass if output matches regex (optional flags: "i")
is-jsonPass if output is valid JSON
equalsstringPass if output exactly equals the value (trimmed)

Underscore variants (contains_all, is_json, etc.) are also accepted.

tests:
- id: json-api
criteria: Returns valid JSON with status field
input: Return the system status as JSON
assertions:
- type: is-json
- type: contains
value: '"status"'

Use contains-all or contains-any to check multiple values in a single assertion instead of repeating contains multiple times:

tests:
- id: required-fields
criteria: Response mentions all required fields
input: "Confirm details: name is Alice, email is alice@example.com"
assertions:
- type: contains-all
value: ["Alice", "alice@example.com"]
- id: greeting-variant
criteria: Response includes some form of greeting
input: "Greet the user warmly."
assertions:
- type: contains-any
value: ["Hello", "Hi", "Hey", "Welcome", "Greetings"]

All deterministic assertions support these optional fields:

FieldTypeDescription
negatebooleanInvert the result (pass becomes fail, fail becomes pass)
weightnumberRelative weight when aggregating scores (default: 1)
requiredboolean | numberGate that must pass for overall test to pass. true uses 0.8 threshold; a number sets a custom threshold.
namestringCustom name for the assertion (auto-generated if omitted)
flagsstringRegex flags for regex type (e.g., "i" for case-insensitive)
tests:
- id: no-competitors
criteria: Response must not mention any competitor
input: "Describe our product advantages."
assertions:
- type: contains-any
value: ["CompetitorA", "CompetitorB", "CompetitorC"]
negate: true
- id: required-inputs
criteria: Agent asks for missing rule codes
input: "Process customs entry for country BE."
assertions:
- name: asks-for-rule-codes
type: icontains-any
value: ["rule code", "rule codes"]
required: true
- name: mentions-format
type: icontains-any
value: ["true/false", "boolean", "expected value"]

Assertion evaluators auto-generate a name when one is not provided (e.g., contains-DENIED, is_json).

Use type: rubrics with a criteria array to define structured LLM-graded evaluation criteria inline:

tests:
- id: denied-party
criteria: Must identify denied party
input:
- role: user
content: Screen "Acme Corp" against denied parties list
expected_output:
- role: assistant
content: "DENIED"
assertions:
- type: contains
value: "DENIED"
required: true
- type: rubrics
criteria:
- id: accuracy
outcome: Correctly identifies the denied party
weight: 5.0
- id: reasoning
outcome: Provides clear reasoning for the decision
weight: 3.0

Any evaluator in assertions can be marked as required. When a required evaluator fails, the overall test verdict is fail regardless of the aggregate score.

ValueBehavior
required: trueMust score >= 0.8 (default threshold) to pass
required: 0.6Must score >= 0.6 to pass (custom threshold between 0 and 1)
assertions:
- type: contains
value: "DENIED"
required: true # must pass (>= 0.8)
- type: rubrics
required: 0.6 # must score at least 0.6
criteria:
- id: quality
outcome: Response is well-structured
weight: 1.0

Required gates are evaluated after all evaluators run. If any required evaluator falls below its threshold, the verdict is forced to fail.

assertions can be defined at both suite and test levels:

  • Per-test assertions evaluators run first.
  • Suite-level assertions evaluators are appended automatically.
  • Set execution.skip_defaults: true on a test to skip suite-level defaults.

The criteria field is a data field that describes what the response should accomplish. It is not an evaluator itself — how it gets used depends on whether assertions is present.

When a test has no assertions field, a default llm-grader evaluator runs automatically and uses criteria as the evaluation prompt:

tests:
- id: simple-eval
criteria: Assistant correctly explains the bug and proposes a fix
input: "Debug this function..."
# No assertions → default llm-grader evaluates against criteria

assertions present — explicit evaluators only

Section titled “assertions present — explicit evaluators only”

When assertions is defined, only the declared evaluators run. No implicit grader is added. Graders that are declared (such as llm-grader, code-grader, or rubrics) receive criteria as input automatically.

If assertions contains only deterministic evaluators (like contains or regex), the criteria field is not evaluated and a warning is emitted:

Warning: Test 'my-test': criteria is defined but no evaluator in assertions
will evaluate it. Add 'type: llm-grader' to assertions, or remove criteria
if it is documentation-only.

To use criteria alongside deterministic checks, add a grader explicitly:

tests:
- id: mixed-eval
criteria: Response is helpful and mentions the fix
input: "Debug this function..."
assertions:
- type: llm-grader # explicit — receives criteria automatically
- type: contains
value: "fix"

Pass additional context to evaluators via the metadata field:

tests:
- id: code-gen
criteria: Generates valid Python
metadata:
language: python
difficulty: medium
input: Write a function to sort a list