zos-evals — eval harness.
zos‑evals answers one question honestly: did a change to your AI system actually make it better? You give it a fixed set of tasks, replay every task under two named configurations of the same system, and let a pluggable judge grade and compare the outputs — with the classic statistical traps (judge position bias, single-sample noise, false zeros, believing a sweep) engineered out rather than warned about. Python 3.11+ · PyYAML only · BUSL-1.1
Status: working code · 75 tests · CI green · BUSL-1.1 · repository private until public launch.
Package zos_evals 0.1.0 (src layout). Everything documented here is importable from the top level (from zos_evals import ...) unless noted. The harness never assumes a specific AI vendor — a configuration is just a name bound to an execution recipe.
Key concepts
- Configuration — a named variant of the system under test ("config A vs config B"): same model, different setup (system prompts, hooks, context assembly, tool policies).
- Sample — one replay of one task under one configuration. Replays are stochastic, so each (task, configuration) pair is sampled
n_samplestimes. - Exclusion — a sample whose output is truncated, empty, or errored carries no score (
None). It is missing data, never a 0. - Debiased comparison — every pairwise judgment runs in both orders; only a verdict that survives the position swap counts. Disagreement = tie.
- Clean sweep — one configuration winning every decided per-task verdict. Flagged as judge-calibration risk; treated as directional, never a verdict.
Corpus — zos_evals.corpus
Tasks live in a JSONL file, one task per line, with a validating loader: invalid input fails loudly at load time, with file:line in the error, instead of corrupting a run.
parse_taskvalidates one decoded JSON object; raisesCorpusError(message prefixed withwhere) when the object is not a dict,id/prompt/classis missing, non-string or blank, orexpected_qualitiesis not a list of non-empty strings.load_corpusreads a JSONL file, one task per line, skipping blank lines.CorpusErrorcarriesfile:linefor invalid JSON, failed validation, duplicate ids, a missing file, or an empty corpus.CorpusError(ValueError)is the single exception class for all corpus validation failures.
Executors — zos_evals.executor
An executor is how the harness runs your system. The protocol is one method; implement it to plug in anything — an HTTP call, an SDK, a queue.
An error with usable output is kept; an error with no output leads to exclusion downstream. Two implementations ship:
SubprocessExecutor(commands, *, timeout=600.0)
commands:Mapping[str, str]— configuration name → command template.- The template is split with
shlex.splitonce, then placeholders are substituted per argument (no shell, no quoting hazards):{prompt},{task_id},{class},{config}. The same values are exported to the child asZOS_PROMPT,ZOS_TASK_ID,ZOS_TASK_CLASS,ZOS_CONFIG. - stdout is the captured output. Any stdout line starting with
ZOS_META:is parsed as JSON ({"cost": float, "latency_s": float}), summed into the result, and stripped from the output; malformed meta lines are kept as output. - Failures degrade to results, not exceptions: timeout →
error="timeout after Ns", spawn failure →error="spawn failed: ...", nonzero exit →error="exit N: <stderr excerpt>". execute()with an unknown configuration name raisesKeyError; constructing with no configurations or an empty command raisesValueError.
CallableExecutor(fn)
Wraps fn(config: str, task: Task) -> ExecutionResult | str. A plain string return is wrapped into an ExecutionResult with measured latency. An exception from fn becomes ExecutionResult(output="", error=...).
Replay — zos_evals.replay
- Setup is validated: ≥ 1 configuration, unique names,
n_samples >= 1. run(tasks) -> list[Sample]— every task × configuration × sample index, in deterministic order.- Exclusion rules, applied per sample: an execution error with no usable output → excluded (
execution error: ...); stripped output shorter thanmin_output_chars→ excluded as truncated.
The end-to-end pipeline, returning flat result rows: (1) replay everything via ReplayRunner; (2) if judge is given, rubric-score every non-excluded sample (rubric_score, per_quality); (3) if judge is given and exactly two configurations are compared, run a position-debiased comparison per paired sample index — pairs where either side is excluded are skipped (no fair comparison exists).
Judging — zos_evals.judge
debiased_compare(judge, task, output_a, output_b) → str
The load-bearing debiasing mechanism. Runs judge.compare twice — (A, B) then (B, A) — maps the second verdict back into A/B terms, and returns the common verdict only when both passes agree; any disagreement is a tie. Unrecognized verdict strings normalize to "tie".
KeywordJudge()
Deterministic, offline. score: a quality passes when its text appears case-insensitively in the output. compare: more passed qualities wins; equal is a tie. Useful for tests, smoke runs, and corpora whose expected_qualities are literal markers.
CommandJudge(command, *, timeout=120.0)
The vendor-neutral LLM-judge bridge. command is split with shlex.split; the rendered judge prompt is written to the child's stdin; the first {...} block on stdout is parsed as JSON.
scoreexpects{"per_quality": [{"quality": "...", "pass": true}, ...]}; qualities the judge omits are gradedFalse; a task withoutexpected_qualitiesscoresNone.compareexpects{"winner": "a"|"b"|"tie"}.- Every failure (spawn, timeout, unparseable output) degrades safely:
score→RubricResult(score=None, error=...),compare→"tie". A broken judge must never manufacture a winner. - Run the judge command with a clean configuration (none of the customization under test) so it stays a neutral evaluator.
rubric_fraction(per_quality) -> float | None — helper: passed/total rounded to 4 places; None for an empty rubric.
Scoring & reports — zos_evals.scoring / .report
Input: the flat row list from evaluate() (or re-read from a results JSONL). Output (JSON-serializable):
{
"configs": {
"<name>": {
"n_samples": int, "n_scored": int, "n_excluded": int,
"rubric_mean": float | None, // mean over scored samples only
"mean_cost": float, "mean_latency_s": float,
"per_class": {"<class>": {"rubric_mean": float | None, "n_scored": int}},
},
},
"pairwise": None | {
"config_a": str, "config_b": str,
"comparisons": int, "tasks_compared": int,
"wins": {a: int, b: int}, "ties": int, "decided": int,
"win_rate": {a: float | None, b: float | None}, // of decided tasks
"task_verdicts": {task_id: config_name | "tie"}, // per-task majority
"per_class": {"<class>": {a: int, b: int, "tie": int}},
},
"clean_sweep": str | None, // the sweeping config, if any
"exclusions": [{"task_id", "config", "sample", "reason"}, ...],
}
- Excluded samples never contribute to any mean (no false zeros).
- Per-task verdict = majority of that task's sample comparisons (equal = tie).
win_ratedenominators are decided tasks only (Nonewhen none decided).
Returns the configuration that won all decided per-task verdicts, when at least min_decided verdicts were decided. A single decided task is not evidence of a sweep. CLEAN_SWEEP_WARNING is the canonical warning template (.format(winner=..., decided=...)); the lesson it encodes: a clean sweep is a judge-calibration risk, NOT a verdict — validate the judge (human-label a subset, swap rubrics, vary the judge model) before drawing conclusions.
render_markdown renders an aggregate() result: the clean-sweep warning (when present, at the top, impossible to miss), an overall table (samples/scored/excluded/rubric mean/cost/latency per configuration), per-class rubric means, the pairwise section (counts, win rates, per-task verdicts, per-class breakdown), and an exclusions table. write_report writes it to a file, creating parent directories.
Path resolution (zos_evals.paths.resolve_path): explicit argument wins (tilde-expanded); else $ZOS_HOME/<default_rel> when both are available; else ValueError — the harness never guesses a location.
CLI — zos-evals
Installed as zos-evals (also python -m zos_evals). Exit codes: 0 success, 1 operational error, 2 usage error.
zos-evals validate [CORPUS] # load + validate; prints task and class counts
zos-evals run [flags] # the full pipeline
zos-evals report RESULTS [--out PATH] [--title T] # re-aggregate a results JSONL
Flag (zos-evals run) | Meaning |
|---|---|
--corpus PATH | corpus (default $ZOS_HOME/evals/corpus.jsonl) |
--config NAME=COMMAND | repeatable; command template per configuration |
--samples N | samples per (task, configuration); default 1 |
--min-output-chars N | truncation threshold; shorter outputs are excluded |
--judge {keyword,command,none} | default keyword |
--judge-command CMD | external judge (required with --judge command) |
--timeout S | per-execution timeout (default 600) |
--out PATH | results JSONL (default: rows to stdout) |
--report PATH | also write the markdown report |
A summary (per-config scores, pairwise tallies, and the clean-sweep warning when triggered) prints to stderr.
Result row schema
One JSON object per line in a results file. Sample rows:
{"type": "sample", "task_id": "dec-01", "task_class": "decision",
"config": "baseline", "index": 0, "output": "...",
"latency_s": 2.1, "cost": 0.012, "excluded": false, "reason": null,
"rubric_score": 0.5, "per_quality": {"recommendation": true, "failure mode": false}}
Comparison rows (two-configuration runs with a judge):
{"type": "comparison", "task_id": "dec-01", "task_class": "decision",
"sample": 0, "config_a": "baseline", "config_b": "candidate",
"winner": "candidate"}
winner is a configuration name or "tie", already position-debiased.