The benchmark, with handles.
We built one primitive — an importance scorer with an exponential recency half-life — and measured it on a needle-in-haystack task across windows, needle counts, and seeds. The verdict on Claim 10 is a function of these knobs. Move them; watch it move with you.
Figure 1. Retention as a function of recency half-life, holding window=32k and needles=8. Orange is the highlighted seed; gray is the per-seed cloud; the dark line is the mean. The variance band is wider than the mean's curvature — which is the whole point of the verdict.
Why an immutable event log.
The proposed two-tier mutable storage architecture is structurally wrong. Tier promotion is a point of ruin: if your importance scorer makes a wrong promotion at step 47 of a 200-step run, that error is baked into "consolidated" state and propagates forward.
The append-only log doesn't have this property. Wrong signals are isolated events, not architectural state. Pure scoring functions over the log mean different scorers can be A/B'd against the same historical events — deterministically replayable, debuggable, and amenable to retrospective analysis after new evals or new models ship.
The "git for agent context" framing is the wrong metaphor. Git models diff intention; an agent log is a fact stream. The mental model is closer to Datomic than to git.
from observatory import EventLog, view, importance, confidence
log = EventLog()
log.append({"role": "user", "content": "..."})
log.append({"role": "assistant", "content": "..."})
log.append({"role": "tool", "name": "search", "content": "..."})
# Working memory is a derived view, not stored state
working = view(
log,
scorer=importance.recency_attention(),
window=4096,
)
sig = confidence.dissociation(retrieval_score=0.92, generation_score=0.41)
if sig.diverged():
log.append({"role": "verifier", "action": "re-query", "trigger": sig})
alt = view(log, scorer=importance.task_relevance(query="..."), window=4096)
delta = compare(working, alt) What the numbers say.
The figure below is real. It is the output of python -m eval.benchmarks.needle_retention at the default config — 8 needles randomly placed among 56 noise events, working window 16, 10 seeds. The script is committed at eval/benchmarks/needle_retention.py; the JSON is consumed at build time.
Needle-Retention Benchmark · 10 seeds · ach-benchmark-1
Working-set retention by strategy
Real run of python -m eval.benchmarks.needle_retention. 8 needles randomly placed among 56 noise events; budget = 16 events. Higher is better. Bars show mean retention ± 1 std. Each dot is one seed run — click a seed to see how that single configuration played out across all strategies.
Replay determinism
All strategies returned byte-identical working sets across 3 replays per seed (100%).
What "underperforms baseline" means
recency+role uses default role weights that do not know about needles, so it actively deprioritizes the marker role — a real failure mode of imperfect importance signals.
Reproduce
PYTHONPATH=observatory/src \
python -m eval.benchmarks.needle_retention \
--seeds 10Run timestamp: 2026-05-10T13:07:33Z · seeds: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
- H1 vindicated. The
needle_aware (oracle)scorer returns 100.0% ± 0.0% retention across all 10 seeds. - H2 partially vindicated. A query-anchored
task_relevancescorer that does not know about the privileged "needle" role still achieves 65.0% ± 15.4% retention — 2.4× truncation baseline. - H3 vindicated. All five strategies returned 100% replay-consistent working sets across three replays per seed.
- An unexpected finding.
recency+role(21.2%) underperforms naive truncation (27.5%). The default role-weight table does not include the "needle" role, so the composite scorer actively deprioritizes the marker. The architecture surfaces it cleanly.
Importance scoring is cheap.
Importance scoring adds < 1 ms per view at 50 events. The simpler accuracy/latency comparison from the original eval:
Live observatory eval
Importance-weighted view vs naive truncation
50 events, window 16. Output of python -m observatory.eval compare. Accuracy figures are placeholders pending v0.2 RULER integration; latency is real.
The latency cost is Carmack's missing number. Both view-construction calls operate over the same immutable log; the hygiene path runs the importance scorer once per event. Real-world cost will dominate by your model inference, not by view construction.
v0.1 surface area.
EventLog— immutable append-only event log with stable IDs and timestampsview()— pure function over(log, scorer, window)importance.recency_attention()— composable recency + attention-norm scorerimportance.task_relevance(query=...)— query-anchored scorerconfidence.dissociation()— separate retrieval / generation confidence trackingeval.ruler_extended()— RULER tasks instrumented with hygiene-aware metrics
pip install claim-observatory # PyPI distribution name
# imports as:
from observatory import EventLog, view, importance, confidence
# or from a clone:
git clone https://github.com/abdul-abdi/ai-brain-claims
cd ai-brain-claims/observatory
pip install -e .
pytest # 22 passing tests MIT. Issues and PRs welcome — reproduce instructions and the repo.