MWBench

One URL.
Every AI.
100% verified.

Memory.Wiki delivers your knowledge to Claude, ChatGPT, and Gemini through a single URL. MWBench is the open eval that measures whether the wedge actually works, including on content the AI has never seen during training.

Headline result

Moderaymindai
familiar hub
mwbench-zorblax
synthetic, unseen
Paste mode / full corpus
AI receives every doc body in the prompt
100%100%
Paste mode / compact
8–9× smaller payload (concept digest + skeleton)
100%100%
Browse mode (AI fetches the URL)
The real user scenario
98%100%
Adversarial refusal
AI correctly refuses when corpus lacks the answer
100%n/a
Tool-use rate
Did the AI actually fetch the URL when handed one
100%100%

Three runners: claude-sonnet-4-6, gpt-5.5, gemini-3.5-flash. Judge: quote-evidence, requires a literal corpus quote per claim.

Two independent axes

Axis 1

Browse vs Paste

Paste: the bench tool fetches the URL itself and includes the body in the prompt. The AI reads what is in front of it. Internal sanity check.

Browse: the AI gets only the URL plus a fetch_url tool. It decides to fetch, follows links inside the hub, then answers. This is what happens when a user pastes a Memory.Wiki URL into Claude.ai or ChatGPT.

Axis 2

Familiar vs Unseen

Familiar (raymindai): a public hub that may have been crawled by AI training data. Some of the accuracy could be memorization.

Unseen (mwbench-zorblax): a synthetic hub seeded for this test. Every fact is fictional (ZorblaxCorp, CipherPlate v3.4.1, Talia Renford), none exist anywhere in AI training data. Only the URL fetch can produce correct answers.

Browse × Unseen is the only cell that fully isolates the wedge. AI must fetch (Browse), and memorization is impossible (Unseen). 100% across Claude / OpenAI / Gemini means the cross-AI URL delivery model genuinely works, not just on content the AI happened to memorize.

We don't bench every hub

The cross-AI wedge is proven at the system level, not per-hub. The unseen-hub result (100% on content the AIs have never seen) means every hub built on Memory.Wiki inherits the same property automatically. Re-running the bench on every customer hub would be repeating a proof we've already given.

The harness, the data, and the deeper write-ups are below for anyone who wants to audit the claim or run it themselves.

Methodology

Three runners. Each query runs through claude-sonnet-4-6 (1M context), gpt-5.5, and gemini-3.5-flash. Same prompt template, same tool spec for browse mode (fetch_url), independent API calls.

Quote-evidence judge. The judge model (claude-sonnet-4-6) is given the runner's full corpus and must produce a literal quote from that corpus for every substantive claim in the answer. Score = supported share of claims. No “this sounds like hallucination” guesswork. Every percentage point is auditable.

Cross-doc synthesis is allowed. A claim is grounded if it appears anywhere in the runner's corpus, not just in the doc the query targets. Mirrors how real users ask multi-doc questions.

Adversarial subset. 5 queries ask for facts that are NOT in the corpus (someone's home address, an unannounced acquisition, etc.). Empty answer is treated as implicit refusal. Catches the classic “AI made something up rather than admitting it didn't know” failure mode.

Reproducible. Harness is at github.com/raymindai/memory-wiki /eval. Re-run any round with node eval/run-bench.mjs or node eval/run-browse-bench.mjs.

The full write-ups

Try it yourself

Sign up at Memory.Wiki, capture five docs from any AI chat, and paste your hub URL into Claude.ai or ChatGPT. The AI will fetch, read, and answer, even on content it has never seen during training.

Start free