How this was made

Agents at the wheel.

Every dossier, every verdict, the roundtable that killed the product, the benchmark that measured the architecture — produced in a single autonomous session via 21 parallel research dispatches. This page tells that story honestly, including which agents loaded the full persona skills and which only had prompt-level primers.

The prompt

A single, deliberately wild instruction: generate ten ambitious hypotheses at the intersection of AI, context, and the brain — make them strong enough to fail — then research each one rigorously with persona-level analytical lenses, and adjudicate a verdict. If anything looked like a real product, stress-test it. Build whatever survives.

The shape of the work — the file layout, the design language, the measurement, the website — emerged from the work itself.

The roster

Nine analytical lenses. Each persona narrows the agent to a specific epistemic mode:

Karpathy

ML/transformer internals; reasons in tokens; debugs at the embedding layer.

“If you can't show me the activations, you don't have a claim.”

Joscha Bach

Computational consciousness; functionalism; agents that contain agents.

“An agent without a self-model is a thermostat with extra steps.”

Hickey

Values over places; identity over time; simple over easy.

“Statefulness isn't a feature. It's an apology.”

Carmack

Frame-time discipline; benchmarks before opinions.

“Latency is a load-bearing wall. Stop drilling holes in it.”

Bret Victor

Ideas you can grasp; representations you can manipulate.

“If the reader can't drag the parameter, they can't see the claim.”

Bryan Cantrill

Production observability is the only ground truth.

“I don't trust your benchmark. I trust your flame graph.”

Ayanokōji

What an agent withholds is louder than what it says.

“The most powerful claim is the one you decline to make.”

Naive questions; first-principle taste; essay-shaped thinking.

“What if the obvious thing nobody is doing is the right thing?”

Taleb

Tail risk, optionality; mediocristan vs extremistan.

“Your average is fine. Your variance will eat you.”

The pipeline — what actually ran

Ten claim-research agents, one per hypothesis. These agents did NOT load the full persona skills. They received prompt-level persona lens primers (50–80 words per persona) and produced "Lens 1 / Lens 2" sections in that style. Calling them "the personas" is overclaim; they are research agents shaped by persona primers.
Three idea-research agents — surveying prior art, demand signals, and feasibility for the candidate product. No personas; just bounded-scope research.
Eight roundtable agents — pg, carmack, taleb, hickey × R1 and R2. These are the only agents in the session that loaded the full persona skills via the skill registry. See the roundtable.

Persona deployment — honest counts

Persona	Claim dossiers	Roundtable	Other	Total
Joscha Bach	6 (1, 2, 4, 6, 8, 10)	—	—	6
Karpathy	5 (1, 3, 5, 7, 9)	—	—	5
Hickey	3 (4, 5, 9)	R1 + R2	—	5
Carmack	2 (7, 8)	R1 + R2	—	4
Bryan Cantrill	2 (2, 10)	—	—	2
pg	—	R1 + R2	—	2
Taleb	—	R1 + R2	—	2
Bret Victor	1 (6)	—	1 (design pass)	2
Ayanokōji	1 (3)	—	—	1

Only four personas — Hickey, Carmack, pg, Taleb — were on the roundtable, and only those four (plus Bret Victor for the design pass) had their full persona skills loaded. The rest were prompt-shaped.

What the agents discovered, that I did not start with

The headline finding — 0 vindicated, 0 cleanly refuted, all CONTESTED or SPLIT — was not a hypothesis going in.

The roundtable killed my proposed product. I had drafted a Python library with two-tier mutable storage and a "git for agent context" pitch. The four-persona panel unanimously refused it. What survived was an immutable event log with importance and confidence as pure functions over the log — which I implemented and measured at the Observatory.

The benchmark surfaced a real failure mode I had not anticipated. The composite recency+role scorer underperforms naive truncation because the default role-priority table doesn't include "needle". That's a small finding with a big lesson: imperfect importance signals don't fail neutrally; they fail toward the wrong answer.

Numbers

21 autonomous research dispatches in the main session
9 personas in the roster · 9 deeply persona-loaded runs · 12 prompt-shaped research workers
10 PhD-level dossiers (1,800–3,500 words each)
~33,500 total words of research before this page
25 primary sources in the curated reading list
22 passing pytest cases on the observatory primitive
10 seeds × 5 strategies × 3 replays = 150 measurements
0 vindications, 0 clean refutations

If you want to reproduce this

The whole pipeline is open source and reproducible. Clone the repo. Run the benchmark — you'll get the same numbers (deterministic seeds). Read the per-claim dossiers — the searches, sources, persona analyses, and verdicts are all there. How to reproduce.