zos-telemetry — measurement layer.
zos‑telemetry is the "see what works" layer: it measures how well an AI system is actually serving you — corrections per session, cost, friction — and compares two systems head-to-head. It is content-free by construction: the store holds metrics and short detection markers, and there is literally no field that can hold message text. Python 3.11+ · stdlib only · BUSL-1.1
Status: working code · 28 tests · CI green · BUSL-1.1 · repository private until public launch.
Package zos_telemetry 0.1.0. Storage resolves from an explicit root argument → $ZOS_HOME/telemetry → ~/.zos/telemetry (telemetry_root() — it does not create the directory; writers do, lazily).
Key concepts
- Corrections per session — the headline effectiveness proxy: how often the user had to correct or redirect the assistant. Measured from USER messages (not the assistant's reply format), so it is identical across the systems being compared. Lower is better.
- Content-free markers — raw text is reduced to short pattern-index tokens (
corr.strong.2) at capture time and then discarded. Marker length and count are capped so markers cannot smuggle prose. The test suite asserts stored bytes contain no message text. - Conservative, versioned detection — the detector under-counts by design and every stored row records its
detector_version; refine patterns over time rather than chase individual false hits. - Directional, not a verdict — small-sample comparisons are explicitly labeled; a clean sweep over a few sessions is a hint, not a conclusion.
Event capture — zos_telemetry.events
Append-only store under the resolved root, with two month-bucketed sinks:
| Sink | Contents |
|---|---|
events-YYYY-MM.jsonl | one row per event |
sessions-YYYY-MM.jsonl | one summary row per closed session |
append(*, session_id, system, kind, role=None, ts=None, context=None, model=None, tokens_in=None, tokens_out=None, cost_usd=None, markers=None) -> dict— append one event. Keyword-only and schema-closed: there is no parameter through which message content can be stored.tsdefaults to now (UTC, ISO-8601Z). Markers are validated: max 16 per event, max 32 chars each.append_record(record) -> dict— integration convenience. Rejects content-bearing keys (content,text,message,body,transcript,prompt,completion) and any key outside the schema, then routes throughappend.close_session(session_id, ts=None) -> dict | None— fold the session's events into one summary row, exactly once. Idempotent: re-closing returns the existing row.Noneif the session has no events.read_events(session_id=None)— all events across every monthly sink (oldest file first), optionally filtered to one session.read_sessions()— all closed-session summary rows.
Writers are fail-loud (the library raises); wrap calls at integration points where the host process must never block. Sink files are chmod 0600 best-effort.
Event schema
{
"schema": 1,
"ts": "2026-06-10T18:00:00Z",
"session_id": "s1",
"system": "system_a",
"context": "work",
"role": "user",
"kind": "user_message",
"model": "model-x",
"tokens_in": 1200,
"tokens_out": 300,
"cost_usd": 0.04,
"markers": ["corr.strong.2"]
}
kind is free-form; three values carry aggregation semantics: user_message (counts as a user turn; correction detection applies), assistant_message (counts as a turn), anything else (e.g. cost_tick) contributes only tokens/cost. markers is the only detection surface — short, content-free tokens; never message text. Closed-session summary rows add: detector_version, turns, user_turns, corrections, events, first_ts, last_ts, duration_s, ts_closed, and the dominant model (by cost, falling back to frequency).
Correction detector — zos_telemetry.detector
A "correction" is a user turn that reads as the user correcting or redirecting the assistant. Conservative by design; versioned via DETECTOR_VERSION (currently 1).
is_correction— single-turn classification. Strong signals (explicit wrongness/undo language anywhere in the turn) and opener signals (turn begins with "no"/"wait"/"actually"/"nope").correction_markers— capture-time reduction to content-free tokens (corr.strong.<i>/corr.opener.<i>, pattern indices). Empty list = no signal. Call this, store the result, discard the text.detect_corrections— the subset of events that are corrections:kind == "user_message"with at least onecorr.*marker.corrections_per_session— correction count persession_id, including zeros for sessions with no detected corrections.
Known limitations (kept, documented): it under-counts; English-only patterns; opener false positives are possible ("No, Tuesday works" answering a question counts — applied symmetrically across systems, comparisons stay fair even when absolute counts drift); questions and approvals containing "right" do not fire; heuristic, not semantic.
Aggregation — zos_telemetry.aggregate
Pure functions; no I/O; recomputed whole each call, so re-running never double-counts.
session_summary— whole-session metrics:turns,user_turns,corrections,tokens_in/out,cost_usd,duration_s(first→last event timestamp), dominantmodel.per_day— token/cost rollups keyed by(day, system, context, model); rows sorted; unparseable timestamps land under day"unknown".aggregate_sessions— per-system totals and means over closed-session rows:sessions,corrections,cost_usd,turns,tokens_in/out,duration_s, pluscorrections_per_session,cost_per_session,turns_per_session.
Comparison — zos_telemetry.compare
compare— compares closed-session rows (fromsessions=if given, else theSessionLogatroot).window= trailing days;None= all time. Report keys:system_a/system_b(per-side stats),window_days,sessions_compared,headline(corrections/session, lower is better),directional,note,detector_version,generated_at.to_markdown— renders the report: headline, warning block when directional, metric table.SMALL_SAMPLE_N— below this many sessions on either side the report setsdirectional: trueand carries the "DIRECTIONAL, NOT A VERDICT" note.
CLI — zos-telemetry
All subcommands accept --root DIR.
zos-telemetry append --session-id S --system NAME --kind KIND
[--role R] [--context C] [--model M]
[--tokens-in N] [--tokens-out N] [--cost-usd X] [--ts ISO]
[--marker TOKEN]... [--mark-stdin]
zos-telemetry close --session-id S # exit 1 if no events
zos-telemetry sessions # JSONL on stdout
zos-telemetry compare SYSTEM_A SYSTEM_B [--window DAYS] [--json]
zos-telemetry --version
--mark-stdin reads the user message on stdin, stores only its correction markers, and discards the text — the supported way to feed the detector from a shell hook without persisting content.
# capture (e.g. from each system's stop hook):
echo "$USER_MESSAGE" | zos-telemetry append \
--session-id "$SESSION_ID" --system system_a \
--kind user_message --mark-stdin # markers stored, text discarded
# close the session into one summary row, then compare:
zos-telemetry close --session-id "$SESSION_ID"
zos-telemetry compare system_a system_b --window 30
Environment variables
| Variable | Effect |
|---|---|
ZOS_HOME | Base directory; storage goes to $ZOS_HOME/telemetry when no explicit root is given. Default base: ~/.zos. |