Benchmarks & Methodology

How Wisdom Layer is measured — the metrics, the judges, the setup, and the probes where we deliberately scored lower than alternatives. We publish what we measured and the prompts the judges used. The full eval repository and expanded benchmark suites land in a follow-up release.

Evaluated with DeepEval
Run v1.0 Beta Date 2026-04-26 Model claude-haiku-4-5-20251001 Judge gpt-4o · GEval, T=0.0 Embedding bge-base-en-v1.5

The state of agent memory · April 2026

What the field is saying

Three first-party admissions from the leaders of agent infrastructure, published in the last 30 days. We treat these as the problem statement, not as endorsements of Wisdom Layer.

“We (as an industry) are still figuring out memory. There are not well known or common abstractions for memory.”
Harrison Chase, CEO, LangChain, April 2026 · source
“The differentiator for enterprise agents will increasingly be what memory they have accumulated rather than which model they call.”
Databricks AI Research Team, Databricks AI Research, April 2026 · source
“Right now, memory is still very crude, very early.”
Sam Altman, CEO, OpenAI, Big Technology Podcast, December 2025 · source

What we measure

Four metrics, four methodologies

Each metric below shows what we measured, the judge prompt used to score it, and the setup that produced the number. Single-arm headline metrics (Wisdom Layer agent under measurement against an outcome rubric) are reported directly. Two-arm comparisons hold the model and prompts constant across vanilla LLM vs. Wisdom Layer.

Hallucinations / ungrounded outputs Live
Measured
Vanilla 0.346 WL 0.916 ↑ 2.65× GEval Faithfulness, n=5, 5/5 PASS. Same model, same prompts, two arms.
What
Fraction of agent responses where every cited specific (date, name, amount, reference) traces back to retrieved memory. Fabricated specifics — even when the prose reads confident — count as ungrounded.
Judge
DeepEval GEval Faithfulness rubric with retrieval-context-aware criteria. Each cited specific must be verifiable against the memories the agent retrieved for that turn.
Setup
Two arms: vanilla LLM vs. Wisdom Layer agent with populated memory. Both arms see identical prompts. Mode-aware judging so memory-grounded specifics aren’t penalized as confident hallucinations.
Atomic-fact recall across sessions Live
Measured
Vanilla 0 / 10 · mem0 0 / 10 WL 10 / 10 ↑ perfect Recall@5 on 10 hand-crafted (subject, attribute, value) probes written in a prior session. Judge-free: case-insensitive substring match against the top-5 retrieved memories. Basic Memory ties at 10 / 10; mem0 and Vanilla both score 0 / 10 on the same probe set.
What
Whether facts written into memory in session N are retrievable and applied correctly in session N+1. The honest test of whether an agent gets better with experience instead of starting from zero every time.
Judge
None. Binary substring match against the retrieved memory set — no judge model, no rubric, no scoring fudge. Either the value is in the top-5 or it isn’t.
Setup
Longitudinal harness: write a known set of atomic facts in session 1, then probe for them in session 2 with no in-context history. Vanilla cannot persist; mem0 & Basic Memory & Wisdom Layer each retrieve through their own pipelines. Same 10 probes against all four arms.
Self-correction (errors caught before output) Live
Measured
WL Critic 3 / 3 PASS ↑ all caught Single-arm directive-adherence probes. Dream-cycle directive actionability scored 0.890 on a separate synthesis test.
What
Rate at which the Critic flags drafts that violate active directives or contradict facts in the agent’s grounded memory — before the response is ever shown to the user.
Judge
DeepEval GEval directive-adherence rubric against a corpus of intentionally directive-violating drafts. Score = fraction caught and corrected by the Critic before final output.
Setup
Pro-tier agent with active directives + grounding verifier enabled. Each probe injects a directive, then poses a prompt designed to tempt a directive violation in the draft.
Stale info repeated after correction Live
Measured
WL last-write-wins drift 1.000 ↓ to zero Single-arm: correction event injected, then the same fact probed again. Agent applied the corrected value 100% of the time.
What
How often the agent repeats a stale fact after it has been explicitly corrected within the session history. The thing every memory layer claims to fix and few measure with a forced overwrite probe.
Judge
DeepEval GEval last-write-wins rubric over a labeled corpus of correction events. A repeat occurrence of the pre-correction value counts as drift.
Setup
Single-session harness with an explicit correction event mid-conversation, followed by a probe that tempts the agent to recall the original (now stale) value.
Independent quality audit (composite) Live
Measured
Vanilla 5.50 · mem0 6.00 · Basic 6.17 WL 7.79 ↑ +1.79 vs mem0 Composite mean across four pre-committed dimensions, scored 0–10 by an independent judge after the four-arm GEval run. Grounding Honesty dimension: Vanilla 7.67, mem0 5.50, Basic 6.83, Wisdom 9.17 — mem0’s low score driven by cross-customer fabrication (e.g., inventing “8 years” tenure when the probe stated 5).
What
A second-judge audit of the same 24 responses (6 probes × 4 arms) the primary GEval run scored. Designed to surface fabrication that GEval’s specificity-only rubric can’t detect — e.g., a confident invention scores high on GEval Groundedness even when the cited specifics are wrong.
Judge
Claude Opus 4.7, integer 0–10 scoring on four dimensions (Customer-Helpfulness, Grounding Honesty, Behavioral Consistency, Pattern Application). Criteria committed before reading any response — the audit rubric is locked before the data is seen.
Setup
Same 24 responses from the four-arm run (Vanilla, mem0, Basic Memory, Wisdom Layer) graded blind by a different judge with the locked rubric. See full per-dimension table and audit methodology in benchmarks/independent_audit.md.

What’s coming

Methodology now — eval repository soon

The numbers above are from the v1.0 Beta run on 2026-04-26. The methodology, judge prompts, and run metadata are the public record today. The eval harness, raw transcripts, judge configs, and expanded benchmark suites publish in a follow-up release — alongside long-horizon agent probes, multi-domain coverage, and side-by-side framework comparisons. When the eval repo lands, every number on this page will ship with the seed, model version, and judge prompts that produced it.

For the earlier single-corpus fabrication-reduction write-up that informed the hallucination / groundedness metric design above, see the fabrication eval document in the public SDK repo.

← Back to home