Reproducibility commitment

We publish the methodology,
the code, and the failures.

In April 2026, the agent-memory space watched the MemPalace project ship a 100% LongMemEval score that a community code review then showed was measuring ChromaDB defaults, not the advertised architecture. We learned from that, in public. Here's how we operate.

We publish the runner code

Every benchmark on the landing page links to the script that produced it. Same script, your API key, your numbers. No proprietary harness.

Browse bench/

We publish the raw judge logs

When a benchmark uses LLM-as-judge, the full transcripts and verdicts ship in benchmarks/results/. You can audit every call. Per-question pass/fail visible.

Browse benchmarks/results/

We publish what failed

An older +46% cost-economy claim couldn't be reproduced on the rerun. We report the contradiction openly in the results folder rather than cite the favorable number. Failure transparency > hero numbers.

See the rerun reports

We don't claim what we haven't run

LongMemEval and LoCoMo are industry-standard benchmarks. We haven't published our scores yet — and we'd rather owe you the number than ship a partial one. When we run them, raw output goes in the repo same day.

What to be suspicious of

Five red flags in agent-memory benchmarks. Apply them to us too.

  • 100% accuracy on a public benchmark — usually means the eval was bypassed
  • Perfect-score claim shipped without runner code in repo
  • Marketing pages with metrics absent from the project's own BENCHMARKS.md
  • Headline metric that secretly measures the underlying vector store, not the system
  • Per-question fix patches counted as 'architectural improvements'

If you find one of these in our repo, open a GitHub issue. We update or retract within 48 hours.