Benchmarks & Methodology
How Wisdom Layer is measured — the metrics, the judges, the setup,
and the probes where we deliberately scored lower than alternatives. We
publish what we measured and the prompts the judges used. The full eval
repository and expanded benchmark suites land in a follow-up release.
Evaluated with
What’s coming
Methodology now — eval repository soon
The numbers above are from the v1.0 Beta run on 2026-04-26.
The methodology, judge prompts, and run metadata are the public record today. The
eval harness, raw transcripts, judge configs, and expanded benchmark suites
publish in a follow-up release — alongside long-horizon agent probes,
multi-domain coverage, and side-by-side framework comparisons. When the eval repo
lands, every number on this page will ship with the seed, model version, and
judge prompts that produced it.
For the earlier single-corpus fabrication-reduction write-up that informed the
hallucination / groundedness metric design above, see the
fabrication eval document
in the public SDK repo.