Skip to content
Wave 3 · The Observatory

The benchmark, with handles.

We built one primitive — an importance scorer with an exponential recency half-life — and measured it on a needle-in-haystack task across windows, needle counts, and seeds. The verdict on Claim 10 is a function of these knobs. Move them; watch it move with you.

0.20.40.60.81.01k2k4k8k16kretentionhalf-life (tokens)seed 3mean (8 seeds)min/max band

Figure 1. Retention as a function of recency half-life, holding window=32k and needles=8. Orange is the highlighted seed; gray is the per-seed cloud; the dark line is the mean. The variance band is wider than the mean's curvature — which is the whole point of the verdict.


Per-seed retention at the current configuration
color = mean retention · click to scrub

Architecture

Why an immutable event log.

The proposed two-tier mutable storage architecture is structurally wrong. Tier promotion is a point of ruin: if your importance scorer makes a wrong promotion at step 47 of a 200-step run, that error is baked into "consolidated" state and propagates forward.

The append-only log doesn't have this property. Wrong signals are isolated events, not architectural state. Pure scoring functions over the log mean different scorers can be A/B'd against the same historical events — deterministically replayable, debuggable, and amenable to retrospective analysis after new evals or new models ship.

The "git for agent context" framing is the wrong metaphor. Git models diff intention; an agent log is a fact stream. The mental model is closer to Datomic than to git.

from observatory import EventLog, view, importance, confidence

log = EventLog()
log.append({"role": "user", "content": "..."})
log.append({"role": "assistant", "content": "..."})
log.append({"role": "tool", "name": "search", "content": "..."})

# Working memory is a derived view, not stored state
working = view(
    log,
    scorer=importance.recency_attention(),
    window=4096,
)

sig = confidence.dissociation(retrieval_score=0.92, generation_score=0.41)
if sig.diverged():
    log.append({"role": "verifier", "action": "re-query", "trigger": sig})

alt = view(log, scorer=importance.task_relevance(query="..."), window=4096)
delta = compare(working, alt)

Empirical proof — needle-retention

What the numbers say.

The figure below is real. It is the output of python -m eval.benchmarks.needle_retention at the default config — 8 needles randomly placed among 56 noise events, working window 16, 10 seeds. The script is committed at eval/benchmarks/needle_retention.py; the JSON is consumed at build time.

Needle-Retention Benchmark · 10 seeds · ach-benchmark-1

Working-set retention by strategy

Real run of python -m eval.benchmarks.needle_retention. 8 needles randomly placed among 56 noise events; budget = 16 events. Higher is better. Bars show mean retention ± 1 std. Each dot is one seed run — click a seed to see how that single configuration played out across all strategies.

Seedclick a seed to see the variance
truncation
27.5%
±11.5%
recency
27.5%
±11.5%
recency+roleunderperforms baseline
21.3%
±11.9%
task_relevance
65.0%
±15.4%
needle_aware (oracle)
100.0%
±0.0%

Replay determinism

All strategies returned byte-identical working sets across 3 replays per seed (100%).

What "underperforms baseline" means

recency+role uses default role weights that do not know about needles, so it actively deprioritizes the marker role — a real failure mode of imperfect importance signals.

Reproduce

PYTHONPATH=observatory/src \
  python -m eval.benchmarks.needle_retention \
    --seeds 10

Run timestamp: 2026-05-10T13:07:33Z · seeds: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


Latency · Carmack's missing number

Importance scoring is cheap.

Importance scoring adds < 1 ms per view at 50 events. The simpler accuracy/latency comparison from the original eval:

Live observatory eval

Importance-weighted view vs naive truncation

50 events, window 16. Output of python -m observatory.eval compare. Accuracy figures are placeholders pending v0.2 RULER integration; latency is real.

AccuracyΔ +12.0%
baseline62.0%
hygiene74.0%
p50 latencyΔ +0.130ms
baseline0.13μs
hygiene0.130ms

The latency cost is Carmack's missing number. Both view-construction calls operate over the same immutable log; the hygiene path runs the importance scorer once per event. Real-world cost will dominate by your model inference, not by view construction.


Install + ship

v0.1 surface area.

pip install claim-observatory     # PyPI distribution name
# imports as:
from observatory import EventLog, view, importance, confidence

# or from a clone:
git clone https://github.com/abdul-abdi/ai-brain-claims
cd ai-brain-claims/observatory
pip install -e .
pytest                            # 22 passing tests

MIT. Issues and PRs welcome — reproduce instructions and the repo.