Documentation Index
Fetch the complete documentation index at: https://docs.backant.io/llms.txt
Use this file to discover all available pages before exploring further.
backant eval is Kairos’s self-report layer. It computes weekly metrics against the recent observable history and executes a small simulated-scenario replay to detect quiet regressions that live operation wouldn’t notice.
Use it every Friday alongside your retros.
backant eval run
Generate a Markdown report.
| Option | Default | Notes |
|---|---|---|
--window-days | 7 | Look-back window for metrics. |
--dataset-dir | <workspace>/eval/dataset | Scenario-replay dataset directory. |
~/.claude/kairos/eval/. Each is timestamped and includes:
- Compliance: did Kairos respect
.backant.tomlpolicy across the window? (e.g. did it touch any excluded paths?) - Cost: per-turn USD, per-PR USD, daily totals, comparisons against prior weeks.
- Outcome: PRs opened, PRs merged, CI green / red breakdown, mean time to merge.
- Scenario replay: small simulated tasks executed against the current memory state, to detect regressions in recall quality or judgment.
backant eval report
Print the most recent report:
What scenario replay catches
Live outcome metrics tell you whether work is succeeding on average. Scenario replay catches a different failure mode: silent regressions in capability. Example: a memory rewrite changes how entries are scored. Kairos continues to ship PRs, so cost and outcome metrics look fine. But the scenario replay shows that on a known set of test cues, recall quality dropped 30%. That’s the signal the eval is designed to surface. The replay is intentionally adversarial — small, fixed, executed on every eval — so changes that look fine in production but break the underlying memory layer get caught.Reading a report
A healthy week looks like:- Policy violations > 0 (Kairos is doing something it shouldn’t)
- Cost per turn climbing without outcome rising (work getting expensive without producing more PRs)
- Replay regressions (recall quality dropping, even if production looks fine)
Where reports live
glow for a colorized terminal view.
Tuning the eval window
The default 7-day window is right for most users. Two cases to adjust:- High-velocity teams (>20 PRs/day): reduce to 3 days so trends don’t get washed out
- Slow-moving codebases (a few PRs/week): extend to 14 or 28 days so each report has enough signal
What’s not in the report
For privacy, the eval report does not include:- Raw memory entries
- Specific PR diffs
- Anything that could leak codebase content
backant memory stats or backant memory render locally.