Reproduce every number on this site.
Each quantitative claim has a single command behind it and a deterministic seed. If you run the commands below and the numbers don't match the figures, file an issue.
Setup
git clone https://github.com/abdul-abdi/ai-brain-claims
cd ai-brain-claims/observatory
pip install -e ".[dev]"
pytest -q Expected: 22 passed.
The needle-retention benchmark
The figure on the observatory page is generated by:
cd ..
PYTHONPATH=observatory/src \
python -m eval.benchmarks.needle_retention --seeds 10 Expected output (byte-identical across machines — Python's random.Random(seed) is deterministic):
Needle-retention benchmark — 10 seeds, window=16, needles=8/64
strategy mean ± std replay
------------------------- ------- ------- -------
truncation 27.5% ± 11.5% 100%
recency 27.5% ± 11.5% 100%
recency+role 21.2% ± 11.9% 100%
task_relevance 65.0% ± 15.4% 100%
needle_aware (oracle) 100.0% ± 0.0% 100% Latency numbers will vary by machine; the relative ordering is stable. Retention numbers will not vary. Replay-determinism should always be 100%.
The accuracy / latency comparison
The "Importance-weighted view vs naive truncation" figure:
PYTHONPATH=observatory/src \
python -m observatory.eval compare --steps 50 --window 16 Refresh the figure on the site
cp eval/results/needle-retention.json site/src/data/needle-retention.json
cp eval/results/baseline-vs-hygiene.json site/src/data/baseline-vs-hygiene.json
cd site && npm run build The figures rebuild from the JSON. Push to main and CI redeploys to GitHub Pages within ~50 seconds.
Verify the citations
The 25-paper reading list shows verification status per item. Each verified entry was re-fetched from arXiv on 2026-05-10 and the title and authors were confirmed against the site source. verified-with-correction entries had attribution wrong in an earlier draft and have been fixed in place.
Get the raw data — the data/ archive
The data/ directory in the repo holds the prompts, transcripts, verification outputs, and pipeline manifest used to produce everything on this site. Every quantitative claim on the site is mapped to a file there in data/index.md.
- data/manifests/pipeline.json — every agent dispatch with model, wave, persona-loading status
- data/prompts/ — every prompt template used (claim-research, idea-research, roundtable R1+R2, angle-research, citation-verification, claim-port, design-pass, super-prompt-redesign)
- data/prompts/persona-lens-primers.md — the 50–80 word primers used by the 10 claim-research agents (NOT the full persona skills)
- data/transcripts/ — full roundtable transcript, idea doc, A/B/C angle dossiers
- data/verifications/citation-pass-1.json — 14-paper citation verification
- eval/results/needle-retention.json — full benchmark output
- eval/results/baseline-vs-hygiene.json — accuracy/latency JSON
- site/src/content/claims/ — all 10 claim MDX dossiers
The standard the repo holds itself to: every number on the site should trace to a file in this archive. If you find one that doesn't, file an issue.