“If you can't show me the activations, you don't have a claim.”
Agents at the wheel.
Every dossier, every verdict, the roundtable that killed the product, the benchmark that measured the architecture — produced in a single autonomous session via 21 parallel research dispatches. This page tells that story honestly, including which agents loaded the full persona skills and which only had prompt-level primers.
The prompt
A single, deliberately wild instruction: generate ten ambitious hypotheses at the intersection of AI, context, and the brain — make them strong enough to fail — then research each one rigorously with persona-level analytical lenses, and adjudicate a verdict. If anything looked like a real product, stress-test it. Build whatever survives.
The shape of the work — the file layout, the design language, the measurement, the website — emerged from the work itself.
The roster
Nine analytical lenses. Each persona narrows the agent to a specific epistemic mode:
“An agent without a self-model is a thermostat with extra steps.”
“Statefulness isn't a feature. It's an apology.”
“Latency is a load-bearing wall. Stop drilling holes in it.”
“If the reader can't drag the parameter, they can't see the claim.”
“I don't trust your benchmark. I trust your flame graph.”
“The most powerful claim is the one you decline to make.”
“What if the obvious thing nobody is doing is the right thing?”
“Your average is fine. Your variance will eat you.”
The pipeline — what actually ran
- Ten claim-research agents, one per hypothesis. These agents did NOT load the full persona skills. They received prompt-level persona lens primers (50–80 words per persona) and produced "Lens 1 / Lens 2" sections in that style. Calling them "the personas" is overclaim; they are research agents shaped by persona primers.
- Three idea-research agents — surveying prior art, demand signals, and feasibility for the candidate product. No personas; just bounded-scope research.
- Eight roundtable agents — pg, carmack, taleb, hickey × R1 and R2. These are the only agents in the session that loaded the full persona skills via the skill registry. See the roundtable.
Persona deployment — honest counts
| Persona | Claim dossiers | Roundtable | Other | Total |
|---|---|---|---|---|
| Joscha Bach | 6 (1, 2, 4, 6, 8, 10) | — | — | 6 |
| Karpathy | 5 (1, 3, 5, 7, 9) | — | — | 5 |
| Hickey | 3 (4, 5, 9) | R1 + R2 | — | 5 |
| Carmack | 2 (7, 8) | R1 + R2 | — | 4 |
| Bryan Cantrill | 2 (2, 10) | — | — | 2 |
| pg | — | R1 + R2 | — | 2 |
| Taleb | — | R1 + R2 | — | 2 |
| Bret Victor | 1 (6) | — | 1 (design pass) | 2 |
| Ayanokōji | 1 (3) | — | — | 1 |
Only four personas — Hickey, Carmack, pg, Taleb — were on the roundtable, and only those four (plus Bret Victor for the design pass) had their full persona skills loaded. The rest were prompt-shaped.
What the agents discovered, that I did not start with
The headline finding — 0 vindicated, 0 cleanly refuted, all CONTESTED or SPLIT — was not a hypothesis going in.
The roundtable killed my proposed product. I had drafted a Python library with two-tier mutable storage and a "git for agent context" pitch. The four-persona panel unanimously refused it. What survived was an immutable event log with importance and confidence as pure functions over the log — which I implemented and measured at the Observatory.
The benchmark surfaced a real failure mode I had not anticipated. The composite recency+role scorer underperforms naive truncation because the default role-priority table doesn't include "needle". That's a small finding with a big lesson: imperfect importance signals don't fail neutrally; they fail toward the wrong answer.
Numbers
- 21 autonomous research dispatches in the main session
- 9 personas in the roster · 9 deeply persona-loaded runs · 12 prompt-shaped research workers
- 10 PhD-level dossiers (1,800–3,500 words each)
- ~33,500 total words of research before this page
- 25 primary sources in the curated reading list
- 22 passing pytest cases on the observatory primitive
- 10 seeds × 5 strategies × 3 replays = 150 measurements
- 0 vindications, 0 clean refutations
If you want to reproduce this
The whole pipeline is open source and reproducible. Clone the repo. Run the benchmark — you'll get the same numbers (deterministic seeds). Read the per-claim dossiers — the searches, sources, persona analyses, and verdicts are all there. How to reproduce.