Testing & Evaluation

Overview

@prompd/test is the testing framework for .prmd prompt files. It uses colocated .test.prmd sidecar files to define test cases with assertions against compiled and executed prompts.

Colocated discovery: summarize.prmd automatically discovers summarize.test.prmd in the same directory
Three evaluator types ordered by cost: NLP (local/free), Script (custom logic), Prmd (LLM-based)
Fail-fast execution: Evaluators run in cost order — if a cheap assertion fails, expensive evaluators are skipped
CI-friendly: --no-llm flag runs only compilation and NLP assertions with zero API costs

.test.prmd File Format

Test files use YAML frontmatter to define a test suite with one or more test cases. The Markdown content block is optional and serves as the default evaluator prompt for prmd assertions.

Frontmatter Fields

Field	Type	Required	Description
`name`	string	No	Test suite name. Defaults to filename.
`description`	string	No	Human-readable description of the test suite.
`target`	string	No	Relative path to source `.prmd`. Auto-discovered from filename if omitted.
`max_tokens`	number	No	Default max tokens for LLM execution across all test cases.
`tests`	array	Yes	Array of test case definitions.

Test Case Fields

Each entry in the tests array supports:

Field	Type	Required	Description
`name`	string	No	Test case name. Defaults to `test_N`.
`params`	object	No	Parameters passed to the target `.prmd` for compilation.
`assert`	array	No	Array of assertion definitions.
`expect_error`	boolean	No	If `true`, the test passes when compilation fails.

Evaluator Types

Assertions use one of three evaluator types. They execute in cost order: NLP first, then Script, then Prmd. If any assertion fails, remaining evaluators in the chain are skipped.

NLP Evaluator

Local, fast, free, and deterministic. Runs string and token checks against the LLM response without any external calls.

- evaluator: nlp
  check: contains
  value: "expected text"

Check Types

Check	Value Type	Description
`contains`	string or string[]	Response contains all values (case-insensitive).
`not_contains`	string or string[]	Response contains none of the values.
`matches`	string	Response matches the given regex pattern.
`max_tokens`	number	Estimated token count is at most this value.
`min_tokens`	number	Estimated token count is at least this value.
`starts_with`	string	Response starts with this value (case-insensitive).
`ends_with`	string	Response ends with this value (case-insensitive).

Script Evaluator

Runs a custom script for validation logic that goes beyond string matching.

- evaluator: script
  run: ./validators/schema-check.ts

The script receives a JSON object on stdin with the following shape:

interface ScriptInput {
  prompt: string;      // The compiled prompt sent to the LLM
  response: string;    // The LLM's response
  params: object;      // The test case parameters
  metadata: object;    // Additional execution metadata
}

Exit code 0 = PASS
Exit code 1 = FAIL
stdout = reason string (displayed in test output)

Prmd Evaluator

Uses an LLM to evaluate the response. The evaluator prompt can come from the content block of the .test.prmd file, a local .prmd file, or a registry package.

# Use the content block of this .test.prmd as the evaluator prompt
- evaluator: prmd

# Use a local .prmd file as the evaluator
- evaluator: prmd
  prompt: ./evaluators/my-evaluator.prmd

# Use a registry package as the evaluator
- evaluator: prmd
  prompt: "@prompd/eval-coherence@^1.0.0"

# Override provider and model for this evaluator
- evaluator: prmd
  provider: anthropic
  model: claude-sonnet-4-20250514

Template Variables

The following variables are available inside evaluator prompts:

Variable	Description
`{{ prompt }}`	The compiled prompt that was sent to the LLM.
`{{ response }}`	The LLM’s response being evaluated.
`{{ params }}`	The full test case parameters object (JSON).
`{{ params.key }}`	Individual parameter value via dot notation.

The params parameter type must be object (not string) to enable dot notation access like {{ params.name }}.

The evaluator prompt must respond with PASS or FAIL as the first word, followed by a reason.

Execution

Evaluator Ordering

Evaluators always run in cost order regardless of how they appear in the assert array:

NLP — local string/token checks (free)
Script — custom validation scripts (free, but has process overhead)
Prmd — LLM-based evaluation (costs API tokens)

If any assertion at a given tier fails, all remaining evaluators are skipped (fail-fast).

No-LLM Mode

The --no-llm flag restricts execution to compilation and NLP assertions only. Prmd evaluators are skipped entirely. This is useful for CI pipelines where you want fast, deterministic checks without API costs.

Provider and Model Resolution

When executing test cases, the provider and model are resolved in priority order:

Source .prmd frontmatter (provider / model fields)
Test run options (UI selector or CLI flags)
User config defaults (~/.prompd/config.yaml)
Built-in fallback

Complete Example

Given a prompt file hello.prmd that accepts a name parameter, the following hello.test.prmd defines two test cases with mixed evaluator types:

---
id: hello-test
name: "hello.test"
version: 0.0.1
description: "Tests for Hello World prompt"
target: ./hello.prmd
max_tokens: 64
tests:
  - name: "greets Alice"
    params:
      name: "Alice"
    assert:
      - evaluator: nlp
        check: contains
        value: "Alice"
      - evaluator: nlp
        check: min_tokens
        value: 5

  - name: "output is friendly"
    params:
      name: "Bob"
    assert:
      - evaluator: nlp
        check: not_contains
        value: ["error", "fail", "sorry"]
      - evaluator: prmd
---

# Evaluator

You are evaluating a greeting message.

- **Name provided:** {{ params.name }}
- Addresses the person by name: "{{ params.name }}"
- Is friendly and warm tone
- Is 2-3 sentences

**Prompt:** {{ prompt }}
**Response:** {{ response }}

Respond with PASS or FAIL followed by a one-sentence reason.

The first test case (“greets Alice”) uses only NLP assertions — it checks that the response contains “Alice” and is at least 5 tokens. No LLM evaluation is needed.

The second test case (“output is friendly”) combines an NLP assertion with a Prmd evaluator. The NLP check runs first to verify no error-like words appear. If that passes, the Prmd evaluator uses the content block (the # Evaluator section) as its prompt, with {{ params.name }}, {{ prompt }}, and {{ response }} substituted at evaluation time.

Package Architecture

@prompd/test
  TestRunner          # Orchestrates test execution
  TestParser          # Parses .test.prmd files
  TestDiscovery       # Finds colocated test files
  EvaluatorEngine     # Routes assertions to evaluators
  evaluators/
    NlpEvaluator      # String and token checks
    ScriptEvaluator   # External script runner
    PrmdEvaluator     # LLM-based evaluation
  reporters/
    ConsoleReporter   # Terminal output
    JsonReporter      # Machine-readable JSON
    JunitReporter     # JUnit XML for CI integration

Peer dependency: @prompd/cli

CLI Integration

The TestHarness interface in @prompd/cli allows test framework registration as a plugin:

cli.registerTestHarness(harness);

# Run a specific test file
prompd test ./hello.test.prmd

# Run tests without LLM evaluation (NLP assertions only)
prompd test ./hello.test.prmd --no-llm