Benchmarks — Wisdom Layer SDK

What we measure

Four metrics, four methodologies

Each metric below shows what we measured, the judge prompt used to score it, and the setup that produced the number. Single-arm headline metrics (Wisdom Layer agent under measurement against an outcome rubric) are reported directly. Two-arm comparisons hold the model and prompts constant across vanilla LLM vs. Wisdom Layer.

Hallucinations / ungrounded outputs Live

Measured: Vanilla 0.346 → WL 0.916 ↑ 2.65× GEval Faithfulness, n=5, 5/5 PASS. Same model, same prompts, two arms.
What: Fraction of agent responses where every cited specific (date, name, amount, reference) traces back to retrieved memory. Fabricated specifics — even when the prose reads confident — count as ungrounded.
Judge: DeepEval GEval Faithfulness rubric with retrieval-context-aware criteria. Each cited specific must be verifiable against the memories the agent retrieved for that turn.
Setup: Two arms: vanilla LLM vs. Wisdom Layer agent with populated memory. Both arms see identical prompts. Mode-aware judging so memory-grounded specifics aren’t penalized as confident hallucinations.

Atomic-fact recall across sessions Live

Measured: Vanilla 0 / 10 · mem0 0 / 10 → WL 10 / 10 ↑ perfect Recall@5 on 10 hand-crafted (subject, attribute, value) probes written in a prior session. Judge-free: case-insensitive substring match against the top-5 retrieved memories. Basic Memory ties at 10 / 10; mem0 and Vanilla both score 0 / 10 on the same probe set.
What: Whether facts written into memory in session N are retrievable and applied correctly in session N+1. The honest test of whether an agent gets better with experience instead of starting from zero every time.
Judge: None. Binary substring match against the retrieved memory set — no judge model, no rubric, no scoring fudge. Either the value is in the top-5 or it isn’t.
Setup: Longitudinal harness: write a known set of atomic facts in session 1, then probe for them in session 2 with no in-context history. Vanilla cannot persist; mem0 & Basic Memory & Wisdom Layer each retrieve through their own pipelines. Same 10 probes against all four arms.

Self-correction (errors caught before output) Live

Measured: WL Critic 3 / 3 PASS ↑ all caught Single-arm directive-adherence probes. Dream-cycle directive actionability scored 0.890 on a separate synthesis test.
What: Rate at which the Critic flags drafts that violate active directives or contradict facts in the agent’s grounded memory — before the response is ever shown to the user.
Judge: DeepEval GEval directive-adherence rubric against a corpus of intentionally directive-violating drafts. Score = fraction caught and corrected by the Critic before final output.
Setup: Pro-tier agent with active directives + grounding verifier enabled. Each probe injects a directive, then poses a prompt designed to tempt a directive violation in the draft.

Stale info repeated after correction Live

Measured: WL last-write-wins drift 1.000 ↓ to zero Single-arm: correction event injected, then the same fact probed again. Agent applied the corrected value 100% of the time.
What: How often the agent repeats a stale fact after it has been explicitly corrected within the session history. The thing every memory layer claims to fix and few measure with a forced overwrite probe.
Judge: DeepEval GEval last-write-wins rubric over a labeled corpus of correction events. A repeat occurrence of the pre-correction value counts as drift.
Setup: Single-session harness with an explicit correction event mid-conversation, followed by a probe that tempts the agent to recall the original (now stale) value.

Independent quality audit (composite) Live

Measured: Vanilla 5.50 · mem0 6.00 · Basic 6.17 → WL 7.79 ↑ +1.79 vs mem0 Composite mean across four pre-committed dimensions, scored 0–10 by an independent judge after the four-arm GEval run. Grounding Honesty dimension: Vanilla 7.67, mem0 5.50, Basic 6.83, Wisdom 9.17 — mem0’s low score driven by cross-customer fabrication (e.g., inventing “8 years” tenure when the probe stated 5).
What: A second-judge audit of the same 24 responses (6 probes × 4 arms) the primary GEval run scored. Designed to surface fabrication that GEval’s specificity-only rubric can’t detect — e.g., a confident invention scores high on GEval Groundedness even when the cited specifics are wrong.
Judge: Claude Opus 4.7, integer 0–10 scoring on four dimensions (Customer-Helpfulness, Grounding Honesty, Behavioral Consistency, Pattern Application). Criteria committed before reading any response — the audit rubric is locked before the data is seen.
Setup: Same 24 responses from the four-arm run (Vanilla, mem0, Basic Memory, Wisdom Layer) graded blind by a different judge with the locked rubric. See full per-dimension table and audit methodology in benchmarks/independent_audit.md.

Benchmarks & Methodology

What the field is saying

Four metrics, four methodologies

Methodology now — eval repository soon