Wave 3 · The Observatory

The benchmark, with handles.

We built one primitive — an importance scorer with an exponential recency half-life — and measured it on a needle-in-haystack task across windows, needle counts, and seeds. The verdict on Claim 10 is a function of these knobs. Move them; watch it move with you.

Figure 1. Retention as a function of recency half-life, holding window=32k and needles=8. Orange is the highlighted seed; gray is the per-seed cloud; the dark line is the mean. The variance band is wider than the mean's curvature — which is the whole point of the verdict.

Per-seed retention at the current configuration

color = mean retention · click to scrub

Architecture

Why an immutable event log.

The proposed two-tier mutable storage architecture is structurally wrong. Tier promotion is a point of ruin: if your importance scorer makes a wrong promotion at step 47 of a 200-step run, that error is baked into "consolidated" state and propagates forward.

The append-only log doesn't have this property. Wrong signals are isolated events, not architectural state. Pure scoring functions over the log mean different scorers can be A/B'd against the same historical events — deterministically replayable, debuggable, and amenable to retrospective analysis after new evals or new models ship.

The "git for agent context" framing is the wrong metaphor. Git models diff intention; an agent log is a fact stream. The mental model is closer to Datomic than to git.

from observatory import EventLog, view, importance, confidence

log = EventLog()
log.append({"role": "user", "content": "..."})
log.append({"role": "assistant", "content": "..."})
log.append({"role": "tool", "name": "search", "content": "..."})

# Working memory is a derived view, not stored state
working = view(
    log,
    scorer=importance.recency_attention(),
    window=4096,
)

sig = confidence.dissociation(retrieval_score=0.92, generation_score=0.41)
if sig.diverged():
    log.append({"role": "verifier", "action": "re-query", "trigger": sig})

alt = view(log, scorer=importance.task_relevance(query="..."), window=4096)
delta = compare(working, alt)

Empirical proof — needle-retention

What the numbers say.

The figure below is real. It is the output of python -m eval.benchmarks.needle_retention at the default config — 8 needles randomly placed among 56 noise events, working window 16, 10 seeds. The script is committed at eval/benchmarks/needle_retention.py; the JSON is consumed at build time.

Needle-Retention Benchmark · 10 seeds · ach-benchmark-1

Working-set retention by strategy

Real run of python -m eval.benchmarks.needle_retention. 8 needles randomly placed among 56 noise events; budget = 16 events. Higher is better. Bars show mean retention ± 1 std. Each dot is one seed run — click a seed to see how that single configuration played out across all strategies.

Seedclick a seed to see the variance

truncation

27.5%

±11.5%

recency

27.5%

±11.5%

recency+roleunderperforms baseline

21.3%

±11.9%

task_relevance

65.0%

±15.4%

needle_aware (oracle)

100.0%

±0.0%

Replay determinism

All strategies returned byte-identical working sets across 3 replays per seed (100%).

What "underperforms baseline" means

recency+role uses default role weights that do not know about needles, so it actively deprioritizes the marker role — a real failure mode of imperfect importance signals.

Reproduce

PYTHONPATH=observatory/src \
  python -m eval.benchmarks.needle_retention \
    --seeds 10

Run timestamp: 2026-05-10T13:07:33Z · seeds: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

H1 vindicated. The needle_aware (oracle) scorer returns 100.0% ± 0.0% retention across all 10 seeds.
H2 partially vindicated. A query-anchored task_relevance scorer that does not know about the privileged "needle" role still achieves 65.0% ± 15.4% retention — 2.4× truncation baseline.
H3 vindicated. All five strategies returned 100% replay-consistent working sets across three replays per seed.
An unexpected finding. recency+role (21.2%) underperforms naive truncation (27.5%). The default role-weight table does not include the "needle" role, so the composite scorer actively deprioritizes the marker. The architecture surfaces it cleanly.

Latency · Carmack's missing number

Importance scoring is cheap.

Importance scoring adds < 1 ms per view at 50 events. The simpler accuracy/latency comparison from the original eval:

Live observatory eval

Importance-weighted view vs naive truncation

50 events, window 16. Output of python -m observatory.eval compare. Accuracy figures are placeholders pending v0.2 RULER integration; latency is real.

AccuracyΔ +12.0%

baseline62.0%

hygiene74.0%

p50 latencyΔ +0.130ms

baseline0.13μs

hygiene0.130ms

The latency cost is Carmack's missing number. Both view-construction calls operate over the same immutable log; the hygiene path runs the importance scorer once per event. Real-world cost will dominate by your model inference, not by view construction.

Install + ship

v0.1 surface area.

EventLog — immutable append-only event log with stable IDs and timestamps
view() — pure function over (log, scorer, window)
importance.recency_attention() — composable recency + attention-norm scorer
importance.task_relevance(query=...) — query-anchored scorer
confidence.dissociation() — separate retrieval / generation confidence tracking
eval.ruler_extended() — RULER tasks instrumented with hygiene-aware metrics

pip install claim-observatory     # PyPI distribution name
# imports as:
from observatory import EventLog, view, importance, confidence

# or from a clone:
git clone https://github.com/abdul-abdi/ai-brain-claims
cd ai-brain-claims/observatory
pip install -e .
pytest                            # 22 passing tests

MIT. Issues and PRs welcome — reproduce instructions and the repo.