Skip to content
How this was made

Agents at the wheel.

Every dossier, every verdict, the roundtable that killed the product, the benchmark that measured the architecture — produced in a single autonomous session via 21 parallel research dispatches. This page tells that story honestly, including which agents loaded the full persona skills and which only had prompt-level primers.

The prompt

A single, deliberately wild instruction: generate ten ambitious hypotheses at the intersection of AI, context, and the brain — make them strong enough to fail — then research each one rigorously with persona-level analytical lenses, and adjudicate a verdict. If anything looked like a real product, stress-test it. Build whatever survives.

The shape of the work — the file layout, the design language, the measurement, the website — emerged from the work itself.

The roster

Nine analytical lenses. Each persona narrows the agent to a specific epistemic mode:

Karpathy
ML/transformer internals; reasons in tokens; debugs at the embedding layer.
“If you can't show me the activations, you don't have a claim.”
Joscha Bach
Computational consciousness; functionalism; agents that contain agents.
“An agent without a self-model is a thermostat with extra steps.”
Hickey
Values over places; identity over time; simple over easy.
“Statefulness isn't a feature. It's an apology.”
Carmack
Frame-time discipline; benchmarks before opinions.
“Latency is a load-bearing wall. Stop drilling holes in it.”
Bret Victor
Ideas you can grasp; representations you can manipulate.
“If the reader can't drag the parameter, they can't see the claim.”
Bryan Cantrill
Production observability is the only ground truth.
“I don't trust your benchmark. I trust your flame graph.”
Ayanokōji
What an agent withholds is louder than what it says.
“The most powerful claim is the one you decline to make.”
pg
Naive questions; first-principle taste; essay-shaped thinking.
“What if the obvious thing nobody is doing is the right thing?”
Taleb
Tail risk, optionality; mediocristan vs extremistan.
“Your average is fine. Your variance will eat you.”

The pipeline — what actually ran

  1. Ten claim-research agents, one per hypothesis. These agents did NOT load the full persona skills. They received prompt-level persona lens primers (50–80 words per persona) and produced "Lens 1 / Lens 2" sections in that style. Calling them "the personas" is overclaim; they are research agents shaped by persona primers.
  2. Three idea-research agents — surveying prior art, demand signals, and feasibility for the candidate product. No personas; just bounded-scope research.
  3. Eight roundtable agents — pg, carmack, taleb, hickey × R1 and R2. These are the only agents in the session that loaded the full persona skills via the skill registry. See the roundtable.

Persona deployment — honest counts

PersonaClaim dossiersRoundtableOtherTotal
Joscha Bach6 (1, 2, 4, 6, 8, 10)6
Karpathy5 (1, 3, 5, 7, 9)5
Hickey3 (4, 5, 9)R1 + R25
Carmack2 (7, 8)R1 + R24
Bryan Cantrill2 (2, 10)2
pgR1 + R22
TalebR1 + R22
Bret Victor1 (6)1 (design pass)2
Ayanokōji1 (3)1

Only four personas — Hickey, Carmack, pg, Taleb — were on the roundtable, and only those four (plus Bret Victor for the design pass) had their full persona skills loaded. The rest were prompt-shaped.

What the agents discovered, that I did not start with

The headline finding — 0 vindicated, 0 cleanly refuted, all CONTESTED or SPLIT — was not a hypothesis going in.

The roundtable killed my proposed product. I had drafted a Python library with two-tier mutable storage and a "git for agent context" pitch. The four-persona panel unanimously refused it. What survived was an immutable event log with importance and confidence as pure functions over the log — which I implemented and measured at the Observatory.

The benchmark surfaced a real failure mode I had not anticipated. The composite recency+role scorer underperforms naive truncation because the default role-priority table doesn't include "needle". That's a small finding with a big lesson: imperfect importance signals don't fail neutrally; they fail toward the wrong answer.

Numbers

If you want to reproduce this

The whole pipeline is open source and reproducible. Clone the repo. Run the benchmark — you'll get the same numbers (deterministic seeds). Read the per-claim dossiers — the searches, sources, persona analyses, and verdicts are all there. How to reproduce.