Documentation · zos-evals

zos-evals — eval harness.

The eval harness, derived from the source of zos_evals 0.1.0. Working code, June 2026.

zos‑evals answers one question honestly: did a change to your AI system actually make it better? You give it a fixed set of tasks, replay every task under two named configurations of the same system, and let a pluggable judge grade and compare the outputs — with the classic statistical traps (judge position bias, single-sample noise, false zeros, believing a sweep) engineered out rather than warned about. Python 3.11+ · PyYAML only · BUSL-1.1

Status: working code · 75 tests · CI green · BUSL-1.1 · repository private until public launch.

Package zos_evals 0.1.0 (src layout). Everything documented here is importable from the top level (from zos_evals import ...) unless noted. The harness never assumes a specific AI vendor — a configuration is just a name bound to an execution recipe.

Key concepts

Corpus — zos_evals.corpus

Tasks live in a JSONL file, one task per line, with a validating loader: invalid input fails loudly at load time, with file:line in the error, instead of corrupting a run.

@dataclass(frozen=True) class Task: id: str # unique within a corpus prompt: str # the text replayed under each configuration task_class: str # mirrors the JSON key "class"; drives per-class scoring expected_qualities: tuple[str, ...] # the rubric the judge grades against metadata: dict # all unknown JSON keys, preserved verbatim
parse_task(obj, *, where="task") -> Task load_corpus(path) -> list[Task]

Executors — zos_evals.executor

An executor is how the harness runs your system. The protocol is one method; implement it to plug in anything — an HTTP call, an SDK, a queue.

class Executor(Protocol): def execute(self, config: str, task: Task) -> ExecutionResult: ... @dataclass(frozen=True) class ExecutionResult: output: str latency_s: float cost: float error: str | None

An error with usable output is kept; an error with no output leads to exclusion downstream. Two implementations ship:

SubprocessExecutor(commands, *, timeout=600.0)

CallableExecutor(fn)

Wraps fn(config: str, task: Task) -> ExecutionResult | str. A plain string return is wrapped into an ExecutionResult with measured latency. An exception from fn becomes ExecutionResult(output="", error=...).

Replay — zos_evals.replay

@dataclass(frozen=True) class Sample: task_id: str; task_class: str; config: str; index: int output: str; latency_s: float; cost: float excluded: bool; reason: str | None # to_row() -> the JSON-serializable dict form ({"type": "sample", ...}) ReplayRunner(executor, configs, *, n_samples=1, min_output_chars=1)
evaluate(tasks, executor, configs, *, judge=None, n_samples=1, min_output_chars=1) -> list[dict]

The end-to-end pipeline, returning flat result rows: (1) replay everything via ReplayRunner; (2) if judge is given, rubric-score every non-excluded sample (rubric_score, per_quality); (3) if judge is given and exactly two configurations are compared, run a position-debiased comparison per paired sample index — pairs where either side is excluded are skipped (no fair comparison exists).

Judging — zos_evals.judge

@dataclass(frozen=True) class RubricResult: score: float | None # pass fraction in [0,1]; None when ungradeable — never a false 0 per_quality: tuple[tuple[str, bool], ...] error: str | None class Judge(Protocol): def score(self, task: Task, output: str) -> RubricResult: ... def compare(self, task: Task, output_a: str, output_b: str) -> str: ... # "a"|"b"|"tie"

debiased_compare(judge, task, output_a, output_b) → str

The load-bearing debiasing mechanism. Runs judge.compare twice — (A, B) then (B, A) — maps the second verdict back into A/B terms, and returns the common verdict only when both passes agree; any disagreement is a tie. Unrecognized verdict strings normalize to "tie".

KeywordJudge()

Deterministic, offline. score: a quality passes when its text appears case-insensitively in the output. compare: more passed qualities wins; equal is a tie. Useful for tests, smoke runs, and corpora whose expected_qualities are literal markers.

CommandJudge(command, *, timeout=120.0)

The vendor-neutral LLM-judge bridge. command is split with shlex.split; the rendered judge prompt is written to the child's stdin; the first {...} block on stdout is parsed as JSON.

rubric_fraction(per_quality) -> float | None — helper: passed/total rounded to 4 places; None for an empty rubric.

Scoring & reports — zos_evals.scoring / .report

aggregate(rows) -> dict

Input: the flat row list from evaluate() (or re-read from a results JSONL). Output (JSON-serializable):

{
  "configs": {
    "<name>": {
      "n_samples": int, "n_scored": int, "n_excluded": int,
      "rubric_mean": float | None,        // mean over scored samples only
      "mean_cost": float, "mean_latency_s": float,
      "per_class": {"<class>": {"rubric_mean": float | None, "n_scored": int}},
    },
  },
  "pairwise": None | {
    "config_a": str, "config_b": str,
    "comparisons": int, "tasks_compared": int,
    "wins": {a: int, b: int}, "ties": int, "decided": int,
    "win_rate": {a: float | None, b: float | None},   // of decided tasks
    "task_verdicts": {task_id: config_name | "tie"},  // per-task majority
    "per_class": {"<class>": {a: int, b: int, "tie": int}},
  },
  "clean_sweep": str | None,               // the sweeping config, if any
  "exclusions": [{"task_id", "config", "sample", "reason"}, ...],
}
clean_sweep_winner(task_verdicts, *, min_decided=2) -> str | None

Returns the configuration that won all decided per-task verdicts, when at least min_decided verdicts were decided. A single decided task is not evidence of a sweep. CLEAN_SWEEP_WARNING is the canonical warning template (.format(winner=..., decided=...)); the lesson it encodes: a clean sweep is a judge-calibration risk, NOT a verdict — validate the judge (human-label a subset, swap rubrics, vary the judge model) before drawing conclusions.

render_markdown(agg, *, title="zos-evals report") -> str write_report(agg, path, *, title=...) -> Path

render_markdown renders an aggregate() result: the clean-sweep warning (when present, at the top, impossible to miss), an overall table (samples/scored/excluded/rubric mean/cost/latency per configuration), per-class rubric means, the pairwise section (counts, win rates, per-task verdicts, per-class breakdown), and an exclusions table. write_report writes it to a file, creating parent directories.

Path resolution (zos_evals.paths.resolve_path): explicit argument wins (tilde-expanded); else $ZOS_HOME/<default_rel> when both are available; else ValueError — the harness never guesses a location.

CLI — zos-evals

Installed as zos-evals (also python -m zos_evals). Exit codes: 0 success, 1 operational error, 2 usage error.

zos-evals validate [CORPUS]          # load + validate; prints task and class counts
zos-evals run [flags]                # the full pipeline
zos-evals report RESULTS [--out PATH] [--title T]   # re-aggregate a results JSONL
Flag (zos-evals run)Meaning
--corpus PATHcorpus (default $ZOS_HOME/evals/corpus.jsonl)
--config NAME=COMMANDrepeatable; command template per configuration
--samples Nsamples per (task, configuration); default 1
--min-output-chars Ntruncation threshold; shorter outputs are excluded
--judge {keyword,command,none}default keyword
--judge-command CMDexternal judge (required with --judge command)
--timeout Sper-execution timeout (default 600)
--out PATHresults JSONL (default: rows to stdout)
--report PATHalso write the markdown report

A summary (per-config scores, pairwise tallies, and the clean-sweep warning when triggered) prints to stderr.

Result row schema

One JSON object per line in a results file. Sample rows:

{"type": "sample", "task_id": "dec-01", "task_class": "decision",
 "config": "baseline", "index": 0, "output": "...",
 "latency_s": 2.1, "cost": 0.012, "excluded": false, "reason": null,
 "rubric_score": 0.5, "per_quality": {"recommendation": true, "failure mode": false}}

Comparison rows (two-configuration runs with a judge):

{"type": "comparison", "task_id": "dec-01", "task_class": "decision",
 "sample": 0, "config_a": "baseline", "config_b": "candidate",
 "winner": "candidate"}

winner is a configuration name or "tie", already position-debiased.

This page mirrors docs/API.md in the zos-evals repository, derived from the source at 0.1.0. Companion: platform overview · zos-core library API. Questions? Request early access.