Eval reports - BackAnt

backant eval is Kairos’s self-report layer. It computes weekly metrics against the recent observable history and executes a small simulated-scenario replay to detect quiet regressions that live operation wouldn’t notice. Use it every Friday alongside your retros.

`backant eval run`

Generate a Markdown report.

backant eval run                                 # default 7-day window
backant eval run --window-days 14                # two weeks of history
backant eval run --dataset-dir ./eval/dataset    # custom replay dataset

Option	Default	Notes
`--window-days`	`7`	Look-back window for metrics.
`--dataset-dir`	`<workspace>/eval/dataset`	Scenario-replay dataset directory.

Reports land under ~/.claude/kairos/eval/. Each is timestamped and includes:

Compliance: did Kairos respect .backant.toml policy across the window? (e.g. did it touch any excluded paths?)
Cost: per-turn USD, per-PR USD, daily totals, comparisons against prior weeks.
Outcome: PRs opened, PRs merged, CI green / red breakdown, mean time to merge.
Scenario replay: small simulated tasks executed against the current memory state, to detect regressions in recall quality or judgment.

`backant eval report`

Print the most recent report:

backant eval report

Useful for scripting (paste into Slack, attach to a retro doc).

What scenario replay catches

Live outcome metrics tell you whether work is succeeding on average. Scenario replay catches a different failure mode: silent regressions in capability. Example: a memory rewrite changes how entries are scored. Kairos continues to ship PRs, so cost and outcome metrics look fine. But the scenario replay shows that on a known set of test cues, recall quality dropped 30%. That’s the signal the eval is designed to surface. The replay is intentionally adversarial — small, fixed, executed on every eval — so changes that look fine in production but break the underlying memory layer get caught.

Reading a report

A healthy week looks like:

Compliance: 100% (0 policy violations)
Cost:       $73.20 across 142 turns  ($0.52 / turn)
Outcome:    19 PRs opened, 14 merged (7-day rolling mean: 12)
Replay:     12/12 scenarios pass; recall@5 = 0.81 (last week: 0.79)

An unhealthy week — investigate:

Policy violations > 0 (Kairos is doing something it shouldn’t)
Cost per turn climbing without outcome rising (work getting expensive without producing more PRs)
Replay regressions (recall quality dropping, even if production looks fine)

Where reports live

~/.claude/kairos/eval/
├── 2026-W19.md          # weekly reports, timestamped
├── 2026-W20.md
└── scenarios/
    └── replay-runs/     # per-scenario replay outputs

The Markdown files are designed to be readable on their own — share them with teammates, paste them into Slack, or pipe them through glow for a colorized terminal view.

Tuning the eval window

The default 7-day window is right for most users. Two cases to adjust:

High-velocity teams (>20 PRs/day): reduce to 3 days so trends don’t get washed out
Slow-moving codebases (a few PRs/week): extend to 14 or 28 days so each report has enough signal

backant eval run --window-days 3

What’s not in the report

For privacy, the eval report does not include:

Raw memory entries
Specific PR diffs
Anything that could leak codebase content

Only aggregate metrics + scenario-replay pass/fail counts. If you want detail, use backant memory stats or backant memory render locally.

Documentation Index

​backant eval run

​backant eval report

​What scenario replay catches

​Reading a report

​Where reports live

​Tuning the eval window

​What’s not in the report