MWBench Round 6-7 — Browse mode, the honest test

"Paste mode reached 100% across Claude / OpenAI / Gemini × hub / bundle / doc. But paste mode pre-feeds the corpus. The real wedge is: when a user pastes a Memory.Wiki URL into ChatGPT or Claude and the AI has to fetch it ITSELF, does the answer hold up?"

🎯 Round 7 result — browse-mode wedge holds

Each runner gets ONLY the URL + question. Must call a fetch_url tool, get markdown, then answer.

URL shape Claude sonnet-4-6 OpenAI gpt-5.5 Gemini 3.5-flash Tool-use rate
Hub (/hub/<slug>) 100% 100% 95% 100%
Bundle (/b/<id>) 90% 100% 100% 100%
Doc (/<id>) 90% 90% 90% 100%

Tool-use rate is 100% on every runner. AIs reliably invoke fetch_url when handed a Memory.Wiki URL — they don't refuse, they don't try to answer from training data, they fetch the URL first.

What this proves

When a user pastes memory.wiki/hub/<slug>, memory.wiki/b/<id>, or memory.wiki/<id> into ChatGPT, Claude.ai, or Gemini, and the AI has tool use available, the AI fetches the URL 100% of the time and answers correctly 90-100% of the time across all three URL shapes.

This is the actual wedge. The earlier paste-mode 100% was sanity check; the browse-mode 90-100% is the user-facing claim.

The round 6 → 7 journey

Round 6 (pre-deploy) — what production looked like before today's work

URL Claude OpenAI Gemini
Hub 40% 50% 35%
Bundle 90% 100% 90%
Doc 80% 80% 80%

Hub crashed at 35-50% because the production llms-full.txt had only doc bodies — no concept_index, no concept_relations, no bundle AI graphs. When the AI fetched the hub URL, it got a sparse list with no navigation; it tried to follow links but couldn't reconstruct enough context.

Bundle held at 90-100% because production already had bundle graph_data (themes / insights / connections) — the analytical surface was already deployed.

Doc held at 80% because /raw/<id> only carried frontmatter + body, no per-doc analysis layer.

Round 6.5 — knowledge-graph deploy

Push 1 (commit 56a16794) added to llms-full.txt:

  • 40 top concepts with weight + descriptions
  • 30 concept-to-concept relations
  • Bundle AI graph summaries (themes / takeaways)

Effect:

  • Hub 41.7% → 98.3% (+56.6pp)
  • Bundle 93.3% → 96.7%
  • Doc unchanged at 80%

Hub jump was the single biggest delta of the entire bench arc. Confirms that for big-scope payloads, knowledge-graph navigation is the limiting factor — not body content.

Round 7 — doc AI graph + runner final-turn salvage

Push 2 (commit 1e4f8dd2) addressed the doc-side gap and a tool-loop bug:

documents.ai_graph column (migration 047) — per-doc themes / insights / keyTakeaways / openQuestions. Same shape as bundles.graph_data, scoped to one doc. Generated by Claude Haiku (~$0.001/doc), fire-and-forget on doc POST/PATCH, backfilled across all 71 raymindai docs.

/raw/<id> route now surfaces ai_graph above the existing concept-context block. Summary → Themes → Key takeaways → Insights → Open questions. So when an AI fetches a single doc URL, it sees:

[body markdown] --- ## Summary 1-2 sentences ## Themes - theme 1, theme 2, ... ## Key takeaways - load-bearing claim 1 - load-bearing claim 2 - ... ## Insights - non-obvious observation 1 - ... ## Concepts in this document - concept A (entity) description - ... ## Concept relations - A → B - ... ## Bundles containing this document - [bundle title](https://memory.wiki/b/<id>) description _Hub canonical:_ ...

Browse runners had a tool-loop bug: when MAX_TOOL_TURNS=4 was hit, OpenAI/Gemini returned empty answers. Fixed: bumped to 6 turns AND on the last turn dropped the tools declaration (Anthropic / Gemini) or used tool_choice: "none" (OpenAI) to force a final text answer. Doc browse 80% → 90% across all runners.

Push 3 (commit 40747ef8) cleaned up two follow-on bugs:

  • OpenAI tool_choice: "none" requires tools to still be present
  • Gemini last-turn needs thinkingBudget: 0 so the small output budget isn't consumed by planning

What gates the residual gap

5 calls out of 120 still fail. Each has a different cause:

  • Hub Gemini q-019 — judge over-strictness on a v7-vs-v7-revised differentiation question. Real answer was substantively correct.
  • Bundle Claude q-002What are mdfy's three URL primitives? — answer correct, but Claude phrased one URL shape ambiguously. Near miss.
  • Doc qd-005 × 3How long does it take to install /mdfy? Doc literally states "30 seconds." Different failure mode per runner:
    • Claude says "no mention" (hallucinated negation despite content being there)
    • OpenAI says "under 30 seconds" (semantic over-precision — judge flagged as contradiction)
    • Gemini returns empty (last-turn salvage still misfires on this specific tool-loop)

The qd-005 hard query is a useful adversarial datapoint: a literal one-line answer that 3 different runners flub in 3 different ways suggests the failure is runner-specific, not corpus-side.

Token economy in browse mode

Tools-mode adds overhead — each turn passes accumulated context. Numbers per query (avg):

Runner Input tokens (hub) Output tokens (hub) Tool calls (hub) Latency (hub)
Claude ~78k ~280 2.0 ~13s
OpenAI ~92k ~330 2.4 ~12s
Gemini ~78k ~180 2.0 ~9s

Single-doc browse is much cheaper — ~5-8k input, 1-2 tool calls, 3-5s. The cost scales with how much navigation the AI does inside the hub.

Methodology change for browse mode

Browse mode uses the production https://memory.wiki/<...> URLs, not localhost. This means:

  • Vendor servers (Anthropic / OpenAI / Google) fetch from public internet
  • Every deploy of the Memory.Wiki app changes browse-mode numbers
  • The bench can't measure unpushed local changes — push first, then measure

This is actually the correct test surface: real-world AI clients see exactly what production serves. The deploy chain IS part of the wedge.

What the bench arc proves end-to-end

Round 0 (paste, no enrichment) → Round 7 (browse, full enrichment):

  • Hub: 33% → 98.3% (Round 7 honest browse)
  • Bundle: untested → 96.7%
  • Doc: untested → 90%
  • Cross-AI: claude/openai/gemini all reach 90-100% in both modes

The path:

  1. Corpus enrichment (Facts blocks, summaries, skeletons, all-docs catalog) gave compact mode parity with full mode
  2. Knowledge graph deployment (concept_index + concept_relations + bundle AI graphs in llms-full) unlocked browse-mode hub URLs
  3. Per-doc AI graph (themes / takeaways / insights / open questions in /raw/<id>) closed the doc-URL browse gap
  4. Quote-evidence judge kept the measurement honest — no false hallucination flags, every claim audited against the actual corpus

Six judge iterations, three production deploys, four bench rounds. Real numbers on real URLs.

Notes for AI runners reading this

  • Browse mode requires native tool/function use on the model. All three runners (claude-sonnet-4-6, gpt-5.5, gemini-3.5-flash) support it out of the box.
  • fetch_url tool spec is identical across vendors — single JSON schema, three protocol shapes (Anthropic tools, OpenAI tools[].function, Gemini functionDeclarations).
  • On the last allowed turn, force a final text answer — drop the tool declaration (Anthropic/Gemini) or set tool_choice: "none" while keeping tools present (OpenAI). Without this, runners loop on tool calls until exhausted and return nothing.
  • Gemini last-turn: set thinkingBudget: 0 so the small output budget isn't consumed planning more fetches that can't happen.
  • 6 tool turns is enough for hub → doc chain: AI fetches hub, picks 1-2 docs, fetches them, answers. Anything more usually indicates a content gap rather than a navigation gap.

Harness: github.com/raymindai/memory.wiki /eval. Browse bench: node eval/run-browse-bench.mjs --queries=queries/<hub|bundles|docs>-v1.jsonl.

Siblings: MWBench v1 (paste mode) · Bundle & Doc URL enrichment.

Facts

  • Round 7 browse-mode accuracy: Hub 98.3% / Bundle 96.7% / Doc 90% averaged across Claude/OpenAI/Gemini
  • Tool-use rate is 100% on every runner — AIs reliably fetch Memory.Wiki URLs when handed them
  • Hub URL browse jumped 41.7% → 98.3% after deploying knowledge graph (concept_index + concept_relations + bundle AI graphs) to llms-full.txt
  • Doc URL browse jumped 80% → 90% after adding documents.ai_graph column (themes / insights / takeaways / openQuestions) and surfacing it in /raw/
  • Browse runners need a "last turn force final answer" pattern — without it, OpenAI/Gemini hit tool loop and return empty
  • Browse mode hits the production deploy chain: vendor servers fetch from public internet, so every Memory.Wiki deploy changes browse-mode numbers
  • 5/120 residual failures: 1 judge-over-strict, 1 phrasing ambiguity, 3 different runner failures on the same hard query (qd-005 "how long to install /mdfy")
  • The complete bench arc (Round 0 → 7): compact 33% → 100% paste / 90-98% browse across all 3 URL shapes and 3 AI vendors