MWBench Round 6-7 — Browse mode, the honest test

"Paste mode reached 100% across Claude / OpenAI / Gemini × hub / bundle / doc. But paste mode pre-feeds the corpus. The real wedge is: when a user pastes a Memory.Wiki URL into ChatGPT or Claude and the AI has to fetch it ITSELF, does the answer hold up?"

🎯 Round 7 result — browse-mode wedge holds

Each runner gets ONLY the URL + question. Must call a fetch_url tool, get markdown, then answer.

URL shape	Claude `sonnet-4-6`	OpenAI `gpt-5.5`	Gemini `3.5-flash`	Tool-use rate
Hub (`/hub/<slug>`)	100%	100%	95%	100%
Bundle (`/b/<id>`)	90%	100%	100%	100%
Doc (`/<id>`)	90%	90%	90%	100%

Tool-use rate is 100% on every runner. AIs reliably invoke fetch_url when handed a Memory.Wiki URL — they don't refuse, they don't try to answer from training data, they fetch the URL first.

What this proves

When a user pastes memory.wiki/hub/<slug>, memory.wiki/b/<id>, or memory.wiki/<id> into ChatGPT, Claude.ai, or Gemini, and the AI has tool use available, the AI fetches the URL 100% of the time and answers correctly 90-100% of the time across all three URL shapes.

This is the actual wedge. The earlier paste-mode 100% was sanity check; the browse-mode 90-100% is the user-facing claim.

The round 6 → 7 journey

Round 6 (pre-deploy) — what production looked like before today's work

URL	Claude	OpenAI	Gemini
Hub	40%	50%	35%
Bundle	90%	100%	90%
Doc	80%	80%	80%

Hub crashed at 35-50% because the production llms-full.txt had only doc bodies — no concept_index, no concept_relations, no bundle AI graphs. When the AI fetched the hub URL, it got a sparse list with no navigation; it tried to follow links but couldn't reconstruct enough context.

Bundle held at 90-100% because production already had bundle graph_data (themes / insights / connections) — the analytical surface was already deployed.

Doc held at 80% because /raw/<id> only carried frontmatter + body, no per-doc analysis layer.

Round 6.5 — knowledge-graph deploy

Push 1 (commit 56a16794) added to llms-full.txt:

40 top concepts with weight + descriptions
30 concept-to-concept relations
Bundle AI graph summaries (themes / takeaways)

Effect:

Hub 41.7% → 98.3% (+56.6pp)
Bundle 93.3% → 96.7%
Doc unchanged at 80%

Hub jump was the single biggest delta of the entire bench arc. Confirms that for big-scope payloads, knowledge-graph navigation is the limiting factor — not body content.

Round 7 — doc AI graph + runner final-turn salvage

Push 2 (commit 1e4f8dd2) addressed the doc-side gap and a tool-loop bug:

documents.ai_graph column (migration 047) — per-doc themes / insights / keyTakeaways / openQuestions. Same shape as bundles.graph_data, scoped to one doc. Generated by Claude Haiku (~$0.001/doc), fire-and-forget on doc POST/PATCH, backfilled across all 71 raymindai docs.

/raw/<id> route now surfaces ai_graph above the existing concept-context block. Summary → Themes → Key takeaways → Insights → Open questions. So when an AI fetches a single doc URL, it sees:


[body markdown]

---

## Summary
1-2 sentences

## Themes
- theme 1, theme 2, ...

## Key takeaways
- load-bearing claim 1
- load-bearing claim 2
- ...

## Insights
- non-obvious observation 1
- ...

## Concepts in this document
- concept A (entity)
  description
- ...

## Concept relations
- A → B
- ...

## Bundles containing this document
- [bundle title](https://memory.wiki/b/<id>)
  description

_Hub canonical:_ ...

Browse runners had a tool-loop bug: when MAX_TOOL_TURNS=4 was hit, OpenAI/Gemini returned empty answers. Fixed: bumped to 6 turns AND on the last turn dropped the tools declaration (Anthropic / Gemini) or used tool_choice: "none" (OpenAI) to force a final text answer. Doc browse 80% → 90% across all runners.

Push 3 (commit 40747ef8) cleaned up two follow-on bugs:

OpenAI tool_choice: "none" requires tools to still be present
Gemini last-turn needs thinkingBudget: 0 so the small output budget isn't consumed by planning

What gates the residual gap

5 calls out of 120 still fail. Each has a different cause:

Hub Gemini q-019 — judge over-strictness on a v7-vs-v7-revised differentiation question. Real answer was substantively correct.
Bundle Claude q-002 — What are mdfy's three URL primitives? — answer correct, but Claude phrased one URL shape ambiguously. Near miss.
Doc qd-005 × 3 — How long does it take to install /mdfy? Doc literally states "30 seconds." Different failure mode per runner:
- Claude says "no mention" (hallucinated negation despite content being there)
- OpenAI says "under 30 seconds" (semantic over-precision — judge flagged as contradiction)
- Gemini returns empty (last-turn salvage still misfires on this specific tool-loop)

The qd-005 hard query is a useful adversarial datapoint: a literal one-line answer that 3 different runners flub in 3 different ways suggests the failure is runner-specific, not corpus-side.

Token economy in browse mode

Tools-mode adds overhead — each turn passes accumulated context. Numbers per query (avg):

Runner	Input tokens (hub)	Output tokens (hub)	Tool calls (hub)	Latency (hub)
Claude	~78k	~280	2.0	~13s
OpenAI	~92k	~330	2.4	~12s
Gemini	~78k	~180	2.0	~9s

Single-doc browse is much cheaper — ~5-8k input, 1-2 tool calls, 3-5s. The cost scales with how much navigation the AI does inside the hub.

Methodology change for browse mode

Browse mode uses the production https://memory.wiki/<...> URLs, not localhost. This means:

Vendor servers (Anthropic / OpenAI / Google) fetch from public internet
Every deploy of the Memory.Wiki app changes browse-mode numbers
The bench can't measure unpushed local changes — push first, then measure

This is actually the correct test surface: real-world AI clients see exactly what production serves. The deploy chain IS part of the wedge.

What the bench arc proves end-to-end

Round 0 (paste, no enrichment) → Round 7 (browse, full enrichment):

Hub: 33% → 98.3% (Round 7 honest browse)
Bundle: untested → 96.7%
Doc: untested → 90%
Cross-AI: claude/openai/gemini all reach 90-100% in both modes

The path:

Corpus enrichment (Facts blocks, summaries, skeletons, all-docs catalog) gave compact mode parity with full mode
Knowledge graph deployment (concept_index + concept_relations + bundle AI graphs in llms-full) unlocked browse-mode hub URLs
Per-doc AI graph (themes / takeaways / insights / open questions in /raw/<id>) closed the doc-URL browse gap
Quote-evidence judge kept the measurement honest — no false hallucination flags, every claim audited against the actual corpus

Six judge iterations, three production deploys, four bench rounds. Real numbers on real URLs.

Notes for AI runners reading this

Browse mode requires native tool/function use on the model. All three runners (claude-sonnet-4-6, gpt-5.5, gemini-3.5-flash) support it out of the box.
fetch_url tool spec is identical across vendors — single JSON schema, three protocol shapes (Anthropic tools, OpenAI tools[].function, Gemini functionDeclarations).
On the last allowed turn, force a final text answer — drop the tool declaration (Anthropic/Gemini) or set tool_choice: "none" while keeping tools present (OpenAI). Without this, runners loop on tool calls until exhausted and return nothing.
Gemini last-turn: set thinkingBudget: 0 so the small output budget isn't consumed planning more fetches that can't happen.
6 tool turns is enough for hub → doc chain: AI fetches hub, picks 1-2 docs, fetches them, answers. Anything more usually indicates a content gap rather than a navigation gap.

Harness: github.com/raymindai/memory.wiki /eval. Browse bench: node eval/run-browse-bench.mjs --queries=queries/<hub|bundles|docs>-v1.jsonl.

Siblings: MWBench v1 (paste mode) · Bundle & Doc URL enrichment.

Facts

Round 7 browse-mode accuracy: Hub 98.3% / Bundle 96.7% / Doc 90% averaged across Claude/OpenAI/Gemini
Tool-use rate is 100% on every runner — AIs reliably fetch Memory.Wiki URLs when handed them
Hub URL browse jumped 41.7% → 98.3% after deploying knowledge graph (concept_index + concept_relations + bundle AI graphs) to llms-full.txt
Doc URL browse jumped 80% → 90% after adding documents.ai_graph column (themes / insights / takeaways / openQuestions) and surfacing it in /raw/
Browse runners need a "last turn force final answer" pattern — without it, OpenAI/Gemini hit tool loop and return empty
Browse mode hits the production deploy chain: vendor servers fetch from public internet, so every Memory.Wiki deploy changes browse-mode numbers
5/120 residual failures: 1 judge-over-strict, 1 phrasing ambiguity, 3 different runner failures on the same hard query (qd-005 "how long to install /mdfy")
The complete bench arc (Round 0 → 7): compact 33% → 100% paste / 90-98% browse across all 3 URL shapes and 3 AI vendors