zos-loops — loop engine.
zos‑loops is the runtime that lets an AI system (or any system) run unattended, repeating work on an interval, safely. Unattended loops fail in predictable ways — two copies run at once, a crashed loop leaves a lock behind, a "healthy" heartbeat hides a writer recording stale state, one bad step kills the whole cycle. zos-loops packages the counter-patterns as a small library and CLI. Python 3.11+ · stdlib only · BUSL-1.1
Status: working code · 52 tests · CI green · BUSL-1.1 · repository private until public launch.
All state resolves under <root>/loops/<name>/, where <root> is, in precedence order: an explicit root argument → $ZOS_HOME → ~/.zos. Per-loop files: loop.lock, heartbeat.json, ticks.jsonl.
Key concepts
- Singleton per scope — at most one loop per name, enforced by a PID-liveness lockfile with stale-lock takeover (a dead holder is verified dead before its lock is reclaimed; a live holder is never stolen from).
- Dual-axis heartbeat — every tick rewrites a heartbeat checked on BOTH file mtime (a writer is running) and the inner
tsfield (the writer is recording current state). A writer that keeps touching the file while stamping stale payloads classifies asdiverged, not healthy. - Per-step isolation — a tick is an ordered list of named steps (lowest priority number first). One failing step never kills the tick.
- No self-halt — a loop never decides to stop itself. Failures back off exponentially; they never exit. The only stops are external: a signal/kill, an external
stop_event, or an operator-supplied tick bound set at arm time. - Kill-switch —
ZOS_LOOPS_DISABLED=1refuses to arm new loops and no-ops the ticks of running ones (still heartbeating, still logged) until it is cleared.
Runner — zos_loops.runner
Provide exactly ONE of:
| Arg | Tick becomes |
|---|---|
tick_fn | a single step "tick" running the zero-arg callable |
tick_cmd | a single step "tick" running the shell command |
steps | the given Step sequence, sorted by priority (stable) |
run(max_ticks=None)— arm and run. Refuses to arm when the kill-switch is on ("refused-disabled") or another live loop holds the lock ("refused-held"). Otherwise: the first tick fires inline at arm time (no dead first interval), then the loop ticks everyintervalseconds (plus backoff) until an external stop. Returns a terminal status string; releases the lock on any exit. The loop never returns because a tick failed.max_ticksis an arm-time operator bound (used for--oncescheduler-driven ticks and tests); reaching it returns"stopped-bound". Settingrunner.stop_event(athreading.Event) from outside returns"stopped-external"— the runner never sets it itself.run_tick() -> TickResult— one tick: write heartbeat, run every step in priority order with per-step error isolation, append one JSONL record. Never raises.- Backoff — a whole-tick failure is a tick where zero steps succeeded. After
failure_thresholdconsecutive whole-tick failures, extra sleep ofinterval * backoff_base ** n(capped atbackoff_cap_s) is added, growing each further failure; any non-failed tick resets it. - Kill-switch —
ZOS_LOOPS_DISABLED=1(also exported aszos_loops.KILL_SWITCH_ENV). On at arm → refuse. Flipped on mid-run → ticks become status"disabled"no-ops (no steps run, record still logged) until it is cleared; the loop does not exit.
One named unit of tick work; exactly one of fn / cmd. Lower priority runs first; ties keep declaration order. cmd runs via subprocess (shell=True, output captured, non-zero exit = step failure).
from zos_loops import LoopRunner, Step
runner = LoopRunner(
"groomer",
steps=[
Step(name="unblock", fn=process_blocked, priority=0), # first
Step(name="schedule", fn=promote_scheduled, priority=1),
Step(name="execute", fn=run_ready_work, priority=2),
],
interval=300, # seconds between ticks
failure_threshold=3, # whole-tick failures before backoff
)
runner.run() # acquires the lock, ticks until externally stopped
Health — zos_loops.health
| Status | Meaning |
|---|---|
"stopped" | no lock file — nothing claims to run |
"running" | lock held by a live PID and heartbeat fresh on both axes |
"stale" | everything else: dead-PID lock (crash without cleanup), or a live holder whose heartbeat is missing / unreadable / mtime-stale / diverged |
LoopHealth: name, status, detail (human reason), lock_holder (dict or None), heartbeat (HeartbeatCheck or None).
Heartbeat — zos_loops.heartbeat
write_heartbeat is an atomic (tmp + rename) write of {"ts": iso8601, "epoch": float, "pid": int, "interval_s": float, "tick": int} — mtime and payload move together on a healthy writer. check_heartbeat is the dual check; max_age_s defaults to the interval_s recorded in the heartbeat × 2.5.
| Status | File mtime | Inner ts | Meaning |
|---|---|---|---|
"fresh" | fresh | fresh | healthy |
"stale" | stale | — | writer not running |
"diverged" | fresh | stale | writer runs but records stale state |
"missing" / "unreadable" | — | — | no usable heartbeat |
HeartbeatCheck: status, file_age_s, inner_age_s, max_age_s, payload, property fresh (True only for "fresh").
Lock — zos_loops.lock
acquirereturns(True, "acquired", None)·(True, "stale-reclaim", prior_holder)·(False, "held", holder)·(False, "error: ...", None).- PID-liveness check via signal 0; ambiguous liveness is treated as alive (a lock is never stolen on doubt). Creation is
O_CREAT|O_EXCL(race-safe). A corrupt lock file parses as pid 0 → stale → reclaimable. releaseremoves the lock only ifowner_pid(default: this process) matches the recorded holder.
Tick log — zos_loops.ticklog
ticks.jsonl: one JSON object per tick — {ts, loop, tick, status, duration_ms, steps: [{name, status, ms[, error_type]}], consecutive_failures, backoff_s}. Shape only: no step output, no error messages, no payloads.
Scheduler adapters — zos_loops.schedulers
Template generators — they return text, they never call the OS. Use them so an OS scheduler drives the cadence while the runner provides the safety.
CLI — zos-loop
zos-loop status [--root PATH] [--max-age S] [--json]
zos-loop health <name> [--root PATH] [--max-age S] [--json]
exit code: 0 running · 1 stopped · 2 stale
zos-loop run <name> --cmd CMD [--interval S] [--once | --max-ticks N]
[--failure-threshold K] [--root PATH]
zos-loop schedule <name> --cmd CMD --interval S
--format launchd|systemd|cron [--label L]
Paths helpers (zos_loops.home): zos_home(root=None) · loops_root(root=None) · loop_dir(name, root=None) · safe_name(name).