Memory & Context
LLMs are place-oriented memory systems with shallow effective capacity and no native consolidation/forgetting machinery.
Strong forms systematically failed. Weak forms systematically held. The interesting object isn't the table of results — it's the shape of the disagreement. The page below is the agents' working notebook, with handles. Drag a numeral above; the page beneath rearranges. Reading is doing.
Each cell counts claims (rows) by verdict (columns). The empty vindicated column is the headline. Hover a cell, a chip, a persona — everything else on this page answers.
Nine analytical lenses. Solid edges connect personas paired on a claim — thicker for more papers in that dossier. Dashed edges connect the four roundtable personas. Hover a node to see what it touched; hover a verdict cell above to see which pairs produced that verdict.
x: research thread · y: verdict · size: papers cited · color: verdict. Click to read the dossier; hover to light up its lenses above.
The ten claims sort into four convergent threads. Each thread carries an engineering recommendation that survived adversarial review.
LLMs are place-oriented memory systems with shallow effective capacity and no native consolidation/forgetting machinery.
Brain ↔ transformer mappings hold at the algorithmic level on a restricted subspace; strong 'homology' claims fail.
LLM self-models are real but shallow; the legible CoT trace cannot be trusted as a window into them.
What looks like emergent social cognition is mostly pretraining-derived capability being elicited, not generated.
The research's strongest engineering surface — agent context as an immutable event log with importance and confidence as pure functions over the log — packaged as a Python primitive you can install and run. Different scorers can be A/B'd against the same historical events; failure modes become deterministically replayable.
# install
pip install claim-observatory
# usage
from observatory import EventLog, view, importance
log = EventLog()
log.append({"role": "user", "content": "..."})
log.append({"role": "assistant", "content": "..."})
log.append({"role": "tool", "name": "search", "content": "..."})
# tier membership is a derived view, not a stored place
working = view(
log,
scorer=importance.recency_attention(),
window=4096,
)
# A/B test scorers against the same log — deterministic replay
alt = view(log, scorer=importance.task_relevance(query="..."), window=4096)
delta = compare(working, alt)