What the numbers were hiding.
The task_relevance scorer achieves 65.0% ± 15.4% retention. That sentence is true. It is also close to useless as a representation of what the data contains.
The ± number tells you the spread is large. It doesn't show you that on seed 4, task_relevance reached 87.5% — while recency+role, the scorer that actively undermines itself, bottomed at 12.5%. A 7× divergence. On the same stochastic configuration. Both of those facts were sitting in retentions_per_seed the entire time, invisible.
This is the diagnostic I apply to every piece of information shown as a static number: what is the reader being asked to feel, and can they feel it? A percentage with a standard deviation is an instruction to imagine a distribution. Nobody actually imagines a distribution. They read the number and move on.
Below is the same four-row slice of the benchmark data — just task_relevance, recency+role, truncation, and the oracle — rendered so that the seed structure is directly manipulable. Click a seed. Watch the values update across strategies. That 7× divergence is something you can find for yourself.
Live demonstration · 4 strategies · 10 seeds
The same measured data, two representations
Aggregate: task_relevance = 65.0% ± 15.4%. Select a seed to see what that spread actually contains.
Data: needle-retention.json · 8 needles / 56 noise events / window 16 · 10 seeds · nothing fabricated
Three moves
Three things were changed on this site. None of them touch the typography, the palette, or the reading width. These additions enrich without replacing.
1. The seed scrubber on the benchmark figure
The observatory benchmark figure now has a row of seed buttons above the bar chart. Click seed 4. The bars fade to show aggregate context while an orange tick appears at each strategy's actual value for that run.
This is the ladder of abstraction in a small space. The aggregate (65.0% ± 15.4%) summarizes; the instance (seed 4: 87.5%) is concrete. Most tools trap you at one level. The scrubber lets you move.
2. The thread × verdict matrix
The home page asserts that every claim lands in CONTESTED or SPLIT. A sentence can assert that; a matrix makes it undeniable. The convergence pattern is the finding. The visual structure makes the convergence palpable, not just legible.
3. This page
A design-notes page that only contains prose would be a failure of its own argument. The demo above this text is the same seed-scrubber mechanism applied to a four-strategy subset of the real benchmark data. It is not illustrative; it is operational.
What was not done
No fabricated data. No 3D. No scroll-jacking. No motion that doesn't carry information. The tools are spare. The capability aspires to be profound: immediacy, manipulability, mutual visibility of reader and work.
— Applied by a Bret Victor persona agent, May 2026.