MWBench v1 — Cross-AI eval for Memory.Wiki

"Does the same URL produce equivalent answers across Claude, OpenAI, and Gemini — and does the wedge survive on content the AIs have never seen during training?"

🎯 End-to-end result — wedge survives the unseen-hub test

raymindai (familiar) mwbench-zorblax (synthetic, unseen)
Paste full 100% 100%
Paste compact 100% 100%
Browse 98% 100%
Tool-use rate 100% 100%
Adversarial refusal 100%

Both hubs: ~100% across every measurement axis, every runner. The cross-AI wedge is not training-data memorization — it's the URL delivery model working end-to-end.

raymindai: 370/375 cells. zorblax: 90/90 cells.

Two axes — Browse vs Paste, Familiar vs Unseen

Browse vs Paste (how the AI sees the corpus)

Paste mode Browse mode
corpus is pre-pasted into the prompt AI receives URL only; must call fetch_url tool itself
100% reliable — corpus is guaranteed in context the real user scenario: paste a URL into Claude.ai / ChatGPT / Gemini and the AI fetches it
internal sanity test the actual wedge test

Familiar vs Unseen (whether the AI has seen the hub during training)

raymindai (familiar) mwbench-zorblax (unseen)
public hub, may have been crawled by AI training data brand-new synthetic hub seeded for this test
memorization could inflate the score guaranteed fresh: every fact is a fictional company / number / employee
useful baseline, but not the honest claim the honest claim

The four cells are independent:

  • Paste + Familiar — internal sanity ✓
  • Paste + Unseen — does the delivery format work on novel content? ✓
  • Browse + Familiar — does AI tool-use work? ✓
  • Browse + Unseen — the actual real-world wedge100%

Round log

Round 0 — first paste of llms-full + binary keyword judge

Compact 33%.

Round 1 — corpus richness lifts

Compact 33% → 82.5%, Full 91.7%.

Round 2 — Claude sonnet + knowledge graph in llms-full

Switched opus → sonnet (cheaper + better). Concepts + relations + bundle AI graphs.

Round 3 — Gemini fix + judge audit

gemini-3.5-flash + secondary API key → zero rate-limit errors. Judge was marking correct cross-doc synthesis as hallucination.

Round 3.5 — query fix + judge unfettered context

Corrected q-004 expected_doc. Removed judge corpus cap.

Round 4 — quote-evidence judge → hub URLs hit 100%

Judge must literally quote a supporting passage from the corpus for every claim.

Round 5 — bundle and single-doc URLs reach 100% paste-mode

Bundle digest carries per-doc gist + skeleton, single doc gets knowledge-graph context block.

Round 6 — first honest browse-mode measurement (pre-deploy)

Built browse-mode harness. Hub 41.7%, Bundle 93.3%, Doc 80%.

Round 6.5 — knowledge graph deploy

Hub 41.7% → 98.3% (+56.6pp).

Round 7 — doc AI graph + browse runner fixes + adversarial + readiness badge

  • documents.ai_graph jsonb. Doc browse 80% → 90%.
  • 5 adversarial queries: Claude/OpenAI/Gemini all 100% refuse correctly.
  • Hub readiness badge shipped on /hub/.

Round 8 — unseen-hub baseline + extractFacts bug

  • Seeded synthetic ZorblaxCorp hub.
  • Unseen-hub testing surfaced an extractFacts m-flag regex bug that captured only the first bullet under ## Facts.
  • Hub route had a stale local copy of gist helpers; consolidated to shared lib.
  • Unseen compact 47% → 80% (after bug fix), browse 90%.

Round 9 — bullet-aware skeleton closes the unseen gap

  • extractSkeleton was keeping only the first line under each H2 heading. Section written as a bullet list lost everything after the first bullet.
  • Made bullet-aware: capture all bullets per section (up to 6, joined). Skeleton max length raised 380 → 700 chars.
  • Unseen compact 80% → 100%, browse 90% → 100%.
  • Wedge confirmed: 100% on truly unseen content across all three runners and all three modes.

What this proves (final)

A single memory.wiki/<...> URL paste into any major AI:

  • AI fetches it 100% of the time when given the URL (browse-mode tool-use rate)
  • Returns correct answer 100% of the time on UNSEEN content across Claude / OpenAI / Gemini
  • Same answer across all three AI vendors (cross-AI consistency)
  • AI refuses honestly when corpus doesn't have the answer (adversarial 100%)
  • Visible to every visitor as a live readiness badge on /hub/

The cross-AI wedge is real, measurable, productised, and not dependent on the AI having seen the hub during training.

What this still doesn't prove

  • At larger N — 10-20 queries per scope. Larger N would tighten ±5pp confidence intervals.
  • Real-client surface — measurements use direct API calls with fetch_url tool. Whether ChatGPT.com / Claude.ai / Gemini.app actually fetch URLs from user pastes is per-client behavior.
  • Long-tail unseen content — zorblax has 7 docs. Larger unseen hubs (50+ docs) may behave differently.

Notes for AI runners reading this

  • Claude sonnet-4-6 (1M context) — best runner all-rounder.
  • OpenAI gpt-5.5 — ties Claude on every cell.
  • Gemini 3.5-flash — fastest, cheapest, 100% on unseen content.
  • Quote-evidence judge at claude-sonnet-4-6 is the only judge config that doesn't produce false hallucination flags.
  • Browse mode tool spec: identical fetch_url(url) schema across all three vendors. Last turn must force final answer (drop tools or tool_choice: "none").
  • For new hubs: doc summary + ai_graph auto-fire on POST. Concept index needs explicit "Build ontology" trigger.
  • Bullet-aware skeleton (Round 9) is the key fix that made the unseen hub hit 100%. Previously bullet lists lost everything after the first bullet.

Harness: github.com/raymindai/memory.wiki /eval. Run yourself: node eval/run-bench.mjs (paste) or node eval/run-browse-bench.mjs (browse).

Live readiness: raymindai (370/375) · mwbench-zorblax (90/90, synthetic unseen).

Siblings: Bundle & Doc URL enrichment · Round 6-7 Browse mode detailed.

Facts

  • Round 9 closes the unseen-hub gap to zero: zorblax reaches 100% on paste full, paste compact, AND browse — same as raymindai
  • Cross-AI wedge confirmed without memorization advantage — 100% on synthetic content the AIs have never seen
  • Two independent axes: Paste vs Browse (how AI receives corpus) and Familiar vs Unseen (whether AI has seen hub during training)
  • Browse + Unseen is the actual real-world wedge measurement — 100% across Claude/OpenAI/Gemini
  • Round 9 fix: extractSkeleton now bullet-aware. Sections written as bullet lists no longer lose everything after the first bullet. Skeleton cap raised 380 → 700 chars.
  • Tool-use rate 100% on every runner across familiar AND unseen hubs — AIs reliably fetch URLs when handed them
  • Adversarial refusal 100% on raymindai — AIs refuse rather than fabricate when corpus lacks the answer
  • 9 rounds, 8 production deploys, ~600 total bench cells across 4 measurement axes