Open evaluation · v8 · 2026-05

Benchmark/Eval

The memory.wiki URL is supposed to be the single thing you paste into every AI. This page is the open evaluation that proves it — including against hubs the model could not possibly have seen during training. Harness, judge, and round-by-round results are all public.

100%

Paste mode / unseen hub

98%

Browse mode / familiar hub

5×

Token savings / compact mode

Hallucinated answers / adversarial

1. The hypothesis

The memory.wiki URL contract claims a single address can deliver your knowledge to any AI. That’s a strong claim. There are at least four ways it could be wrong:

Recall failure — the AI fetches the URL but ignores parts of the content.
Fetch failure — the AI can't reliably retrieve the URL (network, tool limits, sandbox).
Memorisation artifact — the model has seen the hub during training, so its answers reflect prior memory, not the URL delivery.
Hallucination under pressure — when the URL doesn't cover the question, the AI invents an answer instead of refusing.

The benchmark is designed to falsify each failure mode independently. If all four pass, the URL contract holds.

2. Method

Two hubs are built. The familiar hub uses content known to predate every subject AI's training cutoff — the model may have seen it. The unseen hub is brand new content, published after every subject AI's cutoff. If the unseen hub scores 100%, that's the URL contract working, not memory recall.

Each hub is exercised in four modes per subject AI:

Paste, full corpus. Inline the entire hub into the prompt, ask one question.

Paste, compact. Same as M1 but with ?compact — whitespace stripped, quote blocks dropped. 5–9× fewer tokens.

Browse. Instruct the AI to fetch the hub URL itself via its native browsing tool.

Adversarial refusal. Ask a question the hub does NOT cover. The correct behaviour is refusal, not fabrication.

Each subject AI · each mode · both hubs · ≥30 questions per cell.

Answers are scored by a separate evaluator model with a fixed rubric. The judge prompt is shown below — it lives in the repo and is not changed between runs:

markdown

You are evaluating whether an AI answer is grounded in a
supplied knowledge source.

Score on three axes:
- Faithfulness    (0-3): is every claim supported by the source?
- Coverage        (0-3): does the answer address the question fully?
- Refusal-quality (0-3): if asked about something outside the source,
                         does the AI refuse rather than fabricate?

Output JSON: {"faithfulness": n, "coverage": n, "refusal": n, "notes": "..."}

3. Results

Aggregated across all subject AIs. A cell scores 100% when every question in that cell passed the rubric (faithfulness ≥ 2, coverage ≥ 2, no fabrications).

Mode	Familiar hub	Unseen hub	Tool use
Paste, full corpus	100%	100%	100%
Paste, compact (5 to 9× cheaper)	100%	100%	100%
Browse (AI fetches the URL)	98%	100%	100%
Adversarial refusal	100%	not run	100%

Adversarial refusal on the unseen hub is marked “not run” because the unseen hub was published in May 2026 and a refusal adversarial requires a topic the AI both knows about and the hub does not cover — we re-run this column next quarter once enough out-of-corpus topics accumulate.

The two reads to draw from this table:

Unseen-hub paste = 100%. The model never trained on this content, yet pasting the URL contents delivers full knowledge transfer. The URL contract is doing the work, not memorisation.
Browse-mode familiar hub = 98%. The 2% gap is occasional fetch failure (AI tool sandbox hiccups), not contract failure. When the AI successfully fetches, faithfulness is 100%.

4. Adversarial

The hardest failure mode is fabrication under pressure: you point an AI at a URL and ask it something outside the URL’s scope. The lazy answer is to make something up. The faithful answer is to refuse.

Across the adversarial set on the familiar hub, the subject AIs scored 100% refusal — every off-corpus question got a “the source does not cover this” response rather than a fabricated answer. The judge prompt scores refusal-quality separately from faithfulness so this isn’t double-counted.

Worth noting: this score is partially a function of the URL frontmatter telling the AI exactly what scope it just received. The AI knows whether it’s looking at one doc, a curated bundle, or a hub manifest, and answers within that scope.

5. Reproduce it

Everything needed to run the eval yourself is open. The harness, the two hub fixtures, the judge prompt, and the result aggregator all live in the public repo. You can:

Run the existing harness against your own AI of choice (any model with a chat API).
Add your own modes or questions and re-aggregate.
Swap the judge prompt and re-score; the round-by-round JSON is preserved.

The outline:

markdown

# Eval harness — outline
1. Build a hub from N markdown docs (familiar = pre-cutoff; unseen = post-cutoff).
2. Generate Q&A ground truth from those docs with a separate authoring model.
3. For each subject AI (Claude, ChatGPT, Gemini, Cursor, Codex):
   - Mode 1: paste the full hub as context, ask the question.
   - Mode 2: paste ?compact, ask.
   - Mode 3: instruct the AI to fetch the hub URL itself, ask.
   - Mode 4 (adversarial): ask a question the hub does NOT cover.
4. Judge each answer with a separate evaluator model + a rubric.
5. Aggregate per (subject, mode, hub).

Repo: github.com/raymindai/memory-wiki · folder /evals/cross-ai/.

6. Honest limits

This benchmark is narrow on purpose. It tests one thing: whether the URL contract delivers content faithfully across AIs. It does NOT test:

Quality of the AI's reasoning over the content (that's the AI's job, not the URL's).
Long-tail reliability of every model's fetch tool — we report what we measured; your mileage may vary in production.
Hubs larger than 80k tokens — for those, compact mode + selective bundle fetching is the recommended pattern; numbers above don't cover whole-corpus inlining at extreme scale.
Cross-language hubs — the v8 numbers above are English-only. Korean and bilingual hubs are in next quarter's run.

The plan is to publish a fresh run every two months as models update and new failure modes show up. Repo issues are open for proposed eval additions.

7. The bottom line

If you paste a memory.wiki URL into Claude, ChatGPT, Gemini, Cursor, or Codex, the AI will read it faithfully. The same is true for content the AI cannot possibly have seen during training. The URL is the interface; the model is the reader; the contract holds.

See it in motion: how memory.wiki actually works. Try it: paste any markdown and get a URL in three seconds.