Documentation · zos-evals

zos-evals — eval harness.

The eval harness, derived from the source of zos_evals 0.1.0. Working code, June 2026.

zos‑evals answers one question honestly: did a change to your AI system actually make it better? You give it a fixed set of tasks, replay every task under two named configurations of the same system, and let a pluggable judge grade and compare the outputs — with the classic statistical traps (judge position bias, single-sample noise, false zeros, believing a sweep) engineered out rather than warned about. Python 3.11+ · PyYAML only · BUSL-1.1

Status: working code · 75 tests · CI green · BUSL-1.1 · repository private until public launch.

Package zos_evals 0.1.0 (src layout). Everything documented here is importable from the top level (from zos_evals import ...) unless noted. The harness never assumes a specific AI vendor — a configuration is just a name bound to an execution recipe.

Key concepts

Configuration — a named variant of the system under test ("config A vs config B"): same model, different setup (system prompts, hooks, context assembly, tool policies).
Sample — one replay of one task under one configuration. Replays are stochastic, so each (task, configuration) pair is sampled n_samples times.
Exclusion — a sample whose output is truncated, empty, or errored carries no score (None). It is missing data, never a 0.
Debiased comparison — every pairwise judgment runs in both orders; only a verdict that survives the position swap counts. Disagreement = tie.
Clean sweep — one configuration winning every decided per-task verdict. Flagged as judge-calibration risk; treated as directional, never a verdict.

Corpus — `zos_evals.corpus`

Tasks live in a JSONL file, one task per line, with a validating loader: invalid input fails loudly at load time, with file:line in the error, instead of corrupting a run.

@dataclass(frozen=True) class Task: id: str # unique within a corpus prompt: str # the text replayed under each configuration task_class: str # mirrors the JSON key "class"; drives per-class scoring expected_qualities: tuple[str, ...] # the rubric the judge grades against metadata: dict # all unknown JSON keys, preserved verbatim

parse_task(obj, *, where="task") -> Task load_corpus(path) -> list[Task]

parse_task validates one decoded JSON object; raises CorpusError (message prefixed with where) when the object is not a dict, id/prompt/class is missing, non-string or blank, or expected_qualities is not a list of non-empty strings.
load_corpus reads a JSONL file, one task per line, skipping blank lines. CorpusError carries file:line for invalid JSON, failed validation, duplicate ids, a missing file, or an empty corpus.
CorpusError(ValueError) is the single exception class for all corpus validation failures.

Executors — `zos_evals.executor`

An executor is how the harness runs your system. The protocol is one method; implement it to plug in anything — an HTTP call, an SDK, a queue.

class Executor(Protocol): def execute(self, config: str, task: Task) -> ExecutionResult: ... @dataclass(frozen=True) class ExecutionResult: output: str latency_s: float cost: float error: str | None

An error with usable output is kept; an error with no output leads to exclusion downstream. Two implementations ship:

SubprocessExecutor(commands, *, timeout=600.0)

commands: Mapping[str, str] — configuration name → command template.
The template is split with shlex.split once, then placeholders are substituted per argument (no shell, no quoting hazards): {prompt}, {task_id}, {class}, {config}. The same values are exported to the child as ZOS_PROMPT, ZOS_TASK_ID, ZOS_TASK_CLASS, ZOS_CONFIG.
stdout is the captured output. Any stdout line starting with ZOS_META: is parsed as JSON ({"cost": float, "latency_s": float}), summed into the result, and stripped from the output; malformed meta lines are kept as output.
Failures degrade to results, not exceptions: timeout → error="timeout after Ns", spawn failure → error="spawn failed: ...", nonzero exit → error="exit N: <stderr excerpt>".
execute() with an unknown configuration name raises KeyError; constructing with no configurations or an empty command raises ValueError.

CallableExecutor(fn)

Wraps fn(config: str, task: Task) -> ExecutionResult | str. A plain string return is wrapped into an ExecutionResult with measured latency. An exception from fn becomes ExecutionResult(output="", error=...).

Replay — `zos_evals.replay`

@dataclass(frozen=True) class Sample: task_id: str; task_class: str; config: str; index: int output: str; latency_s: float; cost: float excluded: bool; reason: str | None # to_row() -> the JSON-serializable dict form ({"type": "sample", ...}) ReplayRunner(executor, configs, *, n_samples=1, min_output_chars=1)

Setup is validated: ≥ 1 configuration, unique names, n_samples >= 1.
run(tasks) -> list[Sample] — every task × configuration × sample index, in deterministic order.
Exclusion rules, applied per sample: an execution error with no usable output → excluded (execution error: ...); stripped output shorter than min_output_chars → excluded as truncated.

evaluate(tasks, executor, configs, *, judge=None, n_samples=1, min_output_chars=1) -> list[dict]

The end-to-end pipeline, returning flat result rows: (1) replay everything via ReplayRunner; (2) if judge is given, rubric-score every non-excluded sample (rubric_score, per_quality); (3) if judge is given and exactly two configurations are compared, run a position-debiased comparison per paired sample index — pairs where either side is excluded are skipped (no fair comparison exists).

Judging — `zos_evals.judge`

@dataclass(frozen=True) class RubricResult: score: float | None # pass fraction in [0,1]; None when ungradeable — never a false 0 per_quality: tuple[tuple[str, bool], ...] error: str | None class Judge(Protocol): def score(self, task: Task, output: str) -> RubricResult: ... def compare(self, task: Task, output_a: str, output_b: str) -> str: ... # "a"|"b"|"tie"

debiased_compare(judge, task, output_a, output_b) → str

The load-bearing debiasing mechanism. Runs judge.compare twice — (A, B) then (B, A) — maps the second verdict back into A/B terms, and returns the common verdict only when both passes agree; any disagreement is a tie. Unrecognized verdict strings normalize to "tie".

KeywordJudge()

Deterministic, offline. score: a quality passes when its text appears case-insensitively in the output. compare: more passed qualities wins; equal is a tie. Useful for tests, smoke runs, and corpora whose expected_qualities are literal markers.

CommandJudge(command, *, timeout=120.0)

The vendor-neutral LLM-judge bridge. command is split with shlex.split; the rendered judge prompt is written to the child's stdin; the first {...} block on stdout is parsed as JSON.

score expects {"per_quality": [{"quality": "...", "pass": true}, ...]}; qualities the judge omits are graded False; a task without expected_qualities scores None.
compare expects {"winner": "a"|"b"|"tie"}.
Every failure (spawn, timeout, unparseable output) degrades safely: score → RubricResult(score=None, error=...), compare → "tie". A broken judge must never manufacture a winner.
Run the judge command with a clean configuration (none of the customization under test) so it stays a neutral evaluator.

rubric_fraction(per_quality) -> float | None — helper: passed/total rounded to 4 places; None for an empty rubric.

Scoring & reports — `zos_evals.scoring` / `.report`

aggregate(rows) -> dict

Input: the flat row list from evaluate() (or re-read from a results JSONL). Output (JSON-serializable):

{
  "configs": {
    "<name>": {
      "n_samples": int, "n_scored": int, "n_excluded": int,
      "rubric_mean": float | None,        // mean over scored samples only
      "mean_cost": float, "mean_latency_s": float,
      "per_class": {"<class>": {"rubric_mean": float | None, "n_scored": int}},
    },
  },
  "pairwise": None | {
    "config_a": str, "config_b": str,
    "comparisons": int, "tasks_compared": int,
    "wins": {a: int, b: int}, "ties": int, "decided": int,
    "win_rate": {a: float | None, b: float | None},   // of decided tasks
    "task_verdicts": {task_id: config_name | "tie"},  // per-task majority
    "per_class": {"<class>": {a: int, b: int, "tie": int}},
  },
  "clean_sweep": str | None,               // the sweeping config, if any
  "exclusions": [{"task_id", "config", "sample", "reason"}, ...],
}

Excluded samples never contribute to any mean (no false zeros).
Per-task verdict = majority of that task's sample comparisons (equal = tie).
win_rate denominators are decided tasks only (None when none decided).

clean_sweep_winner(task_verdicts, *, min_decided=2) -> str | None

Returns the configuration that won all decided per-task verdicts, when at least min_decided verdicts were decided. A single decided task is not evidence of a sweep. CLEAN_SWEEP_WARNING is the canonical warning template (.format(winner=..., decided=...)); the lesson it encodes: a clean sweep is a judge-calibration risk, NOT a verdict — validate the judge (human-label a subset, swap rubrics, vary the judge model) before drawing conclusions.

render_markdown(agg, *, title="zos-evals report") -> str write_report(agg, path, *, title=...) -> Path

render_markdown renders an aggregate() result: the clean-sweep warning (when present, at the top, impossible to miss), an overall table (samples/scored/excluded/rubric mean/cost/latency per configuration), per-class rubric means, the pairwise section (counts, win rates, per-task verdicts, per-class breakdown), and an exclusions table. write_report writes it to a file, creating parent directories.

Path resolution (zos_evals.paths.resolve_path): explicit argument wins (tilde-expanded); else $ZOS_HOME/<default_rel> when both are available; else ValueError — the harness never guesses a location.

CLI — `zos-evals`

Installed as zos-evals (also python -m zos_evals). Exit codes: 0 success, 1 operational error, 2 usage error.

zos-evals validate [CORPUS]          # load + validate; prints task and class counts
zos-evals run [flags]                # the full pipeline
zos-evals report RESULTS [--out PATH] [--title T]   # re-aggregate a results JSONL

Flag (`zos-evals run`)	Meaning
`--corpus PATH`	corpus (default `$ZOS_HOME/evals/corpus.jsonl`)
`--config NAME=COMMAND`	repeatable; command template per configuration
`--samples N`	samples per (task, configuration); default 1
`--min-output-chars N`	truncation threshold; shorter outputs are excluded
`--judge {keyword,command,none}`	default `keyword`
`--judge-command CMD`	external judge (required with `--judge command`)
`--timeout S`	per-execution timeout (default 600)
`--out PATH`	results JSONL (default: rows to stdout)
`--report PATH`	also write the markdown report

A summary (per-config scores, pairwise tallies, and the clean-sweep warning when triggered) prints to stderr.

Result row schema

One JSON object per line in a results file. Sample rows:

{"type": "sample", "task_id": "dec-01", "task_class": "decision",
 "config": "baseline", "index": 0, "output": "...",
 "latency_s": 2.1, "cost": 0.012, "excluded": false, "reason": null,
 "rubric_score": 0.5, "per_quality": {"recommendation": true, "failure mode": false}}

Comparison rows (two-configuration runs with a judge):

{"type": "comparison", "task_id": "dec-01", "task_class": "decision",
 "sample": 0, "config_a": "baseline", "config_b": "candidate",
 "winner": "candidate"}

winner is a configuration name or "tie", already position-debiased.

This page mirrors docs/API.md in the zos-evals repository, derived from the source at 0.1.0. Companion: platform overview · zos-core library API. Questions? Request early access.