One URL.
Every AI.
100% verified.
Memory.Wiki delivers your knowledge to Claude, ChatGPT, and Gemini through a single URL. MWBench is the open eval that measures whether the wedge actually works, including on content the AI has never seen during training.
Headline result
| Mode | raymindai familiar hub | mwbench-zorblax synthetic, unseen |
|---|---|---|
| Paste mode / full corpus AI receives every doc body in the prompt | 100% | 100% |
| Paste mode / compact 8–9× smaller payload (concept digest + skeleton) | 100% | 100% |
| Browse mode (AI fetches the URL) The real user scenario | 98% | 100% |
| Adversarial refusal AI correctly refuses when corpus lacks the answer | 100% | n/a |
| Tool-use rate Did the AI actually fetch the URL when handed one | 100% | 100% |
Three runners: claude-sonnet-4-6, gpt-5.5, gemini-3.5-flash. Judge: quote-evidence, requires a literal corpus quote per claim.
Two independent axes
Browse vs Paste
Paste: the bench tool fetches the URL itself and includes the body in the prompt. The AI reads what is in front of it. Internal sanity check.
Browse: the AI gets only the URL plus a fetch_url tool. It decides to fetch, follows links inside the hub, then answers. This is what happens when a user pastes a Memory.Wiki URL into Claude.ai or ChatGPT.
Familiar vs Unseen
Familiar (raymindai): a public hub that may have been crawled by AI training data. Some of the accuracy could be memorization.
Unseen (mwbench-zorblax): a synthetic hub seeded for this test. Every fact is fictional (ZorblaxCorp, CipherPlate v3.4.1, Talia Renford), none exist anywhere in AI training data. Only the URL fetch can produce correct answers.
We don't bench every hub
The cross-AI wedge is proven at the system level, not per-hub. The unseen-hub result (100% on content the AIs have never seen) means every hub built on Memory.Wiki inherits the same property automatically. Re-running the bench on every customer hub would be repeating a proof we've already given.
The harness, the data, and the deeper write-ups are below for anyone who wants to audit the claim or run it themselves.
Methodology
Three runners. Each query runs through claude-sonnet-4-6 (1M context), gpt-5.5, and gemini-3.5-flash. Same prompt template, same tool spec for browse mode (fetch_url), independent API calls.
Quote-evidence judge. The judge model (claude-sonnet-4-6) is given the runner's full corpus and must produce a literal quote from that corpus for every substantive claim in the answer. Score = supported share of claims. No “this sounds like hallucination” guesswork. Every percentage point is auditable.
Cross-doc synthesis is allowed. A claim is grounded if it appears anywhere in the runner's corpus, not just in the doc the query targets. Mirrors how real users ask multi-doc questions.
Adversarial subset. 5 queries ask for facts that are NOT in the corpus (someone's home address, an unannounced acquisition, etc.). Empty answer is treated as implicit refusal. Catches the classic “AI made something up rather than admitting it didn't know” failure mode.
Reproducible. Harness is at github.com/raymindai/memory-wiki /eval. Re-run any round with node eval/run-bench.mjs or node eval/run-browse-bench.mjs.
The full write-ups
fetch_url tool. Discovers and fetches and answers, all without the corpus in the prompt.Try it yourself
Sign up at Memory.Wiki, capture five docs from any AI chat, and paste your hub URL into Claude.ai or ChatGPT. The AI will fetch, read, and answer, even on content it has never seen during training.
Start free