Research & Writing

Technical deep dives on the architecture, the benchmarks, and the methodology behind agents that genuinely improve over time.

Benchmark Medium

We Tested Our Agent Against Itself. The One With History Won by 10×.

Same model. Same questions. 97.8% accuracy vs 77.8%. 10× fewer fabrications. The only variable was the Wisdom Layer architecture — persistent memory, self-authored rules, and an internal critic that catches narrative inflation before it reaches the user.

Read on Medium →

Architecture Substack

The Wisdom Layer: The Missing Architecture Between LLMs and Intelligence

Why agents that process 10,000 conversations a day learn nothing from any of them. A walk through the gap between retrieval-augmented memory and the closed cognitive loop — capture, reflect, evolve, critic — that lets an agent be different next month than it was today.

Read on Substack →

Methodology Substack

Synthetic Epistemology: How a $0.80/M-Token Model Engineered a Protocol to Falsify Itself

A small model autonomously designed a framework to catch its own confabulation. How a self-critical loop, anchored in append-only provenance, lets a cheap model audit its own outputs without a heavyweight verifier in the path.

Read on Substack →

Engineering Substack

From Memory to Judgment: Engineering Agents That Actually Learn

Memory scaling works. Retrieval isn’t judgment — and judgment is what breaks in production. The case for directive evolution as the real upgrade path for any agent that needs to behave differently next month than it does today.

Read on Substack →