Skip to content
Reproducibility

Reproduce every number on this site.

Each quantitative claim has a single command behind it and a deterministic seed. If you run the commands below and the numbers don't match the figures, file an issue.

Setup

git clone https://github.com/abdul-abdi/ai-brain-claims
cd ai-brain-claims/observatory
pip install -e ".[dev]"
pytest -q

Expected: 22 passed.

The needle-retention benchmark

The figure on the observatory page is generated by:

cd ..
PYTHONPATH=observatory/src \
    python -m eval.benchmarks.needle_retention --seeds 10

Expected output (byte-identical across machines — Python's random.Random(seed) is deterministic):

Needle-retention benchmark — 10 seeds, window=16, needles=8/64

  strategy                      mean    ± std   replay
  -------------------------  -------  -------  -------
  truncation                   27.5%  ± 11.5%     100%
  recency                      27.5%  ± 11.5%     100%
  recency+role                 21.2%  ± 11.9%     100%
  task_relevance               65.0%  ± 15.4%     100%
  needle_aware (oracle)       100.0%  ±  0.0%     100%

Latency numbers will vary by machine; the relative ordering is stable. Retention numbers will not vary. Replay-determinism should always be 100%.

The accuracy / latency comparison

The "Importance-weighted view vs naive truncation" figure:

PYTHONPATH=observatory/src \
    python -m observatory.eval compare --steps 50 --window 16

Refresh the figure on the site

cp eval/results/needle-retention.json site/src/data/needle-retention.json
cp eval/results/baseline-vs-hygiene.json site/src/data/baseline-vs-hygiene.json
cd site && npm run build

The figures rebuild from the JSON. Push to main and CI redeploys to GitHub Pages within ~50 seconds.

Verify the citations

The 25-paper reading list shows verification status per item. Each verified entry was re-fetched from arXiv on 2026-05-10 and the title and authors were confirmed against the site source. verified-with-correction entries had attribution wrong in an earlier draft and have been fixed in place.

Get the raw data — the data/ archive

The data/ directory in the repo holds the prompts, transcripts, verification outputs, and pipeline manifest used to produce everything on this site. Every quantitative claim on the site is mapped to a file there in data/index.md.

The standard the repo holds itself to: every number on the site should trace to a file in this archive. If you find one that doesn't, file an issue.