MWBench v1 — Cross-AI eval for Memory.Wiki

"Does the same URL produce equivalent answers across Claude, OpenAI, and Gemini — and does the wedge survive on content the AIs have never seen during training?"

🎯 End-to-end result — wedge survives the unseen-hub test

	raymindai (familiar)	mwbench-zorblax (synthetic, unseen)
Paste full	100%	100%
Paste compact	100%	100%
Browse	98%	100%
Tool-use rate	100%	100%
Adversarial refusal	100%	—

Both hubs: ~100% across every measurement axis, every runner. The cross-AI wedge is not training-data memorization — it's the URL delivery model working end-to-end.

raymindai: 370/375 cells. zorblax: 90/90 cells.

Two axes — Browse vs Paste, Familiar vs Unseen

Browse vs Paste (how the AI sees the corpus)

Paste mode	Browse mode
corpus is pre-pasted into the prompt	AI receives URL only; must call `fetch_url` tool itself
100% reliable — corpus is guaranteed in context	the real user scenario: paste a URL into Claude.ai / ChatGPT / Gemini and the AI fetches it
internal sanity test	the actual wedge test

Familiar vs Unseen (whether the AI has seen the hub during training)

raymindai (familiar)	mwbench-zorblax (unseen)
public hub, may have been crawled by AI training data	brand-new synthetic hub seeded for this test
memorization could inflate the score	guaranteed fresh: every fact is a fictional company / number / employee
useful baseline, but not the honest claim	the honest claim

The four cells are independent:

Paste + Familiar — internal sanity ✓
Paste + Unseen — does the delivery format work on novel content? ✓
Browse + Familiar — does AI tool-use work? ✓
Browse + Unseen — the actual real-world wedge ✓ 100%

Round log

Round 0 — first paste of `llms-full` + binary keyword judge

Compact 33%.

Round 1 — corpus richness lifts

Compact 33% → 82.5%, Full 91.7%.

Round 2 — Claude sonnet + knowledge graph in `llms-full`

Switched opus → sonnet (cheaper + better). Concepts + relations + bundle AI graphs.

Round 3 — Gemini fix + judge audit

gemini-3.5-flash + secondary API key → zero rate-limit errors. Judge was marking correct cross-doc synthesis as hallucination.

Round 3.5 — query fix + judge unfettered context

Corrected q-004 expected_doc. Removed judge corpus cap.

Round 4 — quote-evidence judge → hub URLs hit 100%

Judge must literally quote a supporting passage from the corpus for every claim.

Round 5 — bundle and single-doc URLs reach 100% paste-mode

Bundle digest carries per-doc gist + skeleton, single doc gets knowledge-graph context block.

Round 6 — first honest browse-mode measurement (pre-deploy)

Built browse-mode harness. Hub 41.7%, Bundle 93.3%, Doc 80%.

Round 6.5 — knowledge graph deploy

Hub 41.7% → 98.3% (+56.6pp).

Round 7 — doc AI graph + browse runner fixes + adversarial + readiness badge

documents.ai_graph jsonb. Doc browse 80% → 90%.
5 adversarial queries: Claude/OpenAI/Gemini all 100% refuse correctly.
Hub readiness badge shipped on /hub/.

Round 8 — unseen-hub baseline + extractFacts bug

Seeded synthetic ZorblaxCorp hub.
Unseen-hub testing surfaced an extractFacts m-flag regex bug that captured only the first bullet under ## Facts.
Hub route had a stale local copy of gist helpers; consolidated to shared lib.
Unseen compact 47% → 80% (after bug fix), browse 90%.

Round 9 — bullet-aware skeleton closes the unseen gap

extractSkeleton was keeping only the first line under each H2 heading. Section written as a bullet list lost everything after the first bullet.
Made bullet-aware: capture all bullets per section (up to 6, joined). Skeleton max length raised 380 → 700 chars.
Unseen compact 80% → 100%, browse 90% → 100%.
Wedge confirmed: 100% on truly unseen content across all three runners and all three modes.

What this proves (final)

A single memory.wiki/<...> URL paste into any major AI:

AI fetches it 100% of the time when given the URL (browse-mode tool-use rate)
Returns correct answer 100% of the time on UNSEEN content across Claude / OpenAI / Gemini
Same answer across all three AI vendors (cross-AI consistency)
AI refuses honestly when corpus doesn't have the answer (adversarial 100%)
Visible to every visitor as a live readiness badge on /hub/

The cross-AI wedge is real, measurable, productised, and not dependent on the AI having seen the hub during training.

What this still doesn't prove

At larger N — 10-20 queries per scope. Larger N would tighten ±5pp confidence intervals.
Real-client surface — measurements use direct API calls with fetch_url tool. Whether ChatGPT.com / Claude.ai / Gemini.app actually fetch URLs from user pastes is per-client behavior.
Long-tail unseen content — zorblax has 7 docs. Larger unseen hubs (50+ docs) may behave differently.

Notes for AI runners reading this

Claude sonnet-4-6 (1M context) — best runner all-rounder.
OpenAI gpt-5.5 — ties Claude on every cell.
Gemini 3.5-flash — fastest, cheapest, 100% on unseen content.
Quote-evidence judge at claude-sonnet-4-6 is the only judge config that doesn't produce false hallucination flags.
Browse mode tool spec: identical fetch_url(url) schema across all three vendors. Last turn must force final answer (drop tools or tool_choice: "none").
For new hubs: doc summary + ai_graph auto-fire on POST. Concept index needs explicit "Build ontology" trigger.
Bullet-aware skeleton (Round 9) is the key fix that made the unseen hub hit 100%. Previously bullet lists lost everything after the first bullet.

Harness: github.com/raymindai/memory.wiki /eval. Run yourself: node eval/run-bench.mjs (paste) or node eval/run-browse-bench.mjs (browse).

Live readiness: raymindai (370/375) · mwbench-zorblax (90/90, synthetic unseen).

Siblings: Bundle & Doc URL enrichment · Round 6-7 Browse mode detailed.

Facts

Round 9 closes the unseen-hub gap to zero: zorblax reaches 100% on paste full, paste compact, AND browse — same as raymindai
Cross-AI wedge confirmed without memorization advantage — 100% on synthetic content the AIs have never seen
Two independent axes: Paste vs Browse (how AI receives corpus) and Familiar vs Unseen (whether AI has seen hub during training)
Browse + Unseen is the actual real-world wedge measurement — 100% across Claude/OpenAI/Gemini
Round 9 fix: extractSkeleton now bullet-aware. Sections written as bullet lists no longer lose everything after the first bullet. Skeleton cap raised 380 → 700 chars.
Tool-use rate 100% on every runner across familiar AND unseen hubs — AIs reliably fetch URLs when handed them
Adversarial refusal 100% on raymindai — AIs refuse rather than fabricate when corpus lacks the answer
9 rounds, 8 production deploys, ~600 total bench cells across 4 measurement axes