MWBench Round 6-7 — Browse mode, the honest test
"Paste mode reached 100% across Claude / OpenAI / Gemini × hub / bundle / doc. But paste mode pre-feeds the corpus. The real wedge is: when a user pastes a Memory.Wiki URL into ChatGPT or Claude and the AI has to fetch it ITSELF, does the answer hold up?"
🎯 Round 7 result — browse-mode wedge holds
Each runner gets ONLY the URL + question. Must call a fetch_url tool, get markdown, then answer.
| URL shape | Claude sonnet-4-6 |
OpenAI gpt-5.5 |
Gemini 3.5-flash |
Tool-use rate |
|---|---|---|---|---|
Hub (/hub/<slug>) |
100% | 100% | 95% | 100% |
Bundle (/b/<id>) |
90% | 100% | 100% | 100% |
Doc (/<id>) |
90% | 90% | 90% | 100% |
Tool-use rate is 100% on every runner. AIs reliably invoke fetch_url when handed a Memory.Wiki URL — they don't refuse, they don't try to answer from training data, they fetch the URL first.
What this proves
When a user pastes memory.wiki/hub/<slug>, memory.wiki/b/<id>, or memory.wiki/<id> into ChatGPT, Claude.ai, or Gemini, and the AI has tool use available, the AI fetches the URL 100% of the time and answers correctly 90-100% of the time across all three URL shapes.
This is the actual wedge. The earlier paste-mode 100% was sanity check; the browse-mode 90-100% is the user-facing claim.
The round 6 → 7 journey
Round 6 (pre-deploy) — what production looked like before today's work
| URL | Claude | OpenAI | Gemini |
|---|---|---|---|
| Hub | 40% | 50% | 35% |
| Bundle | 90% | 100% | 90% |
| Doc | 80% | 80% | 80% |
Hub crashed at 35-50% because the production llms-full.txt had only doc bodies — no concept_index, no concept_relations, no bundle AI graphs. When the AI fetched the hub URL, it got a sparse list with no navigation; it tried to follow links but couldn't reconstruct enough context.
Bundle held at 90-100% because production already had bundle graph_data (themes / insights / connections) — the analytical surface was already deployed.
Doc held at 80% because /raw/<id> only carried frontmatter + body, no per-doc analysis layer.
Round 6.5 — knowledge-graph deploy
Push 1 (commit 56a16794) added to llms-full.txt:
- 40 top concepts with weight + descriptions
- 30 concept-to-concept relations
- Bundle AI graph summaries (themes / takeaways)
Effect:
- Hub 41.7% → 98.3% (+56.6pp)
- Bundle 93.3% → 96.7%
- Doc unchanged at 80%
Hub jump was the single biggest delta of the entire bench arc. Confirms that for big-scope payloads, knowledge-graph navigation is the limiting factor — not body content.
Round 7 — doc AI graph + runner final-turn salvage
Push 2 (commit 1e4f8dd2) addressed the doc-side gap and a tool-loop bug:
documents.ai_graph column (migration 047) — per-doc themes / insights / keyTakeaways / openQuestions. Same shape as bundles.graph_data, scoped to one doc. Generated by Claude Haiku (~$0.001/doc), fire-and-forget on doc POST/PATCH, backfilled across all 71 raymindai docs.
/raw/<id> route now surfaces ai_graph above the existing concept-context block. Summary → Themes → Key takeaways → Insights → Open questions. So when an AI fetches a single doc URL, it sees:
[body markdown]
---
## Summary
1-2 sentences
## Themes
- theme 1, theme 2, ...
## Key takeaways
- load-bearing claim 1
- load-bearing claim 2
- ...
## Insights
- non-obvious observation 1
- ...
## Concepts in this document
- concept A (entity)
description
- ...
## Concept relations
- A → B
- ...
## Bundles containing this document
- [bundle title](https://memory.wiki/b/<id>)
description
_Hub canonical:_ ...
Browse runners had a tool-loop bug: when MAX_TOOL_TURNS=4 was hit, OpenAI/Gemini returned empty answers. Fixed: bumped to 6 turns AND on the last turn dropped the tools declaration (Anthropic / Gemini) or used tool_choice: "none" (OpenAI) to force a final text answer. Doc browse 80% → 90% across all runners.
Push 3 (commit 40747ef8) cleaned up two follow-on bugs:
- OpenAI
tool_choice: "none"requirestoolsto still be present - Gemini last-turn needs
thinkingBudget: 0so the small output budget isn't consumed by planning
What gates the residual gap
5 calls out of 120 still fail. Each has a different cause:
- Hub Gemini q-019 — judge over-strictness on a v7-vs-v7-revised differentiation question. Real answer was substantively correct.
- Bundle Claude q-002 —
What are mdfy's three URL primitives?— answer correct, but Claude phrased one URL shape ambiguously. Near miss. - Doc qd-005 × 3 —
How long does it take to install /mdfy?Doc literally states "30 seconds." Different failure mode per runner:- Claude says "no mention" (hallucinated negation despite content being there)
- OpenAI says "under 30 seconds" (semantic over-precision — judge flagged as contradiction)
- Gemini returns empty (last-turn salvage still misfires on this specific tool-loop)
The qd-005 hard query is a useful adversarial datapoint: a literal one-line answer that 3 different runners flub in 3 different ways suggests the failure is runner-specific, not corpus-side.
Token economy in browse mode
Tools-mode adds overhead — each turn passes accumulated context. Numbers per query (avg):
| Runner | Input tokens (hub) | Output tokens (hub) | Tool calls (hub) | Latency (hub) |
|---|---|---|---|---|
| Claude | ~78k | ~280 | 2.0 | ~13s |
| OpenAI | ~92k | ~330 | 2.4 | ~12s |
| Gemini | ~78k | ~180 | 2.0 | ~9s |
Single-doc browse is much cheaper — ~5-8k input, 1-2 tool calls, 3-5s. The cost scales with how much navigation the AI does inside the hub.
Methodology change for browse mode
Browse mode uses the production https://memory.wiki/<...> URLs, not localhost. This means:
- Vendor servers (Anthropic / OpenAI / Google) fetch from public internet
- Every deploy of the Memory.Wiki app changes browse-mode numbers
- The bench can't measure unpushed local changes — push first, then measure
This is actually the correct test surface: real-world AI clients see exactly what production serves. The deploy chain IS part of the wedge.
What the bench arc proves end-to-end
Round 0 (paste, no enrichment) → Round 7 (browse, full enrichment):
- Hub: 33% → 98.3% (Round 7 honest browse)
- Bundle: untested → 96.7%
- Doc: untested → 90%
- Cross-AI: claude/openai/gemini all reach 90-100% in both modes
The path:
- Corpus enrichment (Facts blocks, summaries, skeletons, all-docs catalog) gave compact mode parity with full mode
- Knowledge graph deployment (concept_index + concept_relations + bundle AI graphs in
llms-full) unlocked browse-mode hub URLs - Per-doc AI graph (themes / takeaways / insights / open questions in
/raw/<id>) closed the doc-URL browse gap - Quote-evidence judge kept the measurement honest — no false hallucination flags, every claim audited against the actual corpus
Six judge iterations, three production deploys, four bench rounds. Real numbers on real URLs.
Notes for AI runners reading this
- Browse mode requires native tool/function use on the model. All three runners (
claude-sonnet-4-6,gpt-5.5,gemini-3.5-flash) support it out of the box. fetch_urltool spec is identical across vendors — single JSON schema, three protocol shapes (Anthropictools, OpenAItools[].function, GeminifunctionDeclarations).- On the last allowed turn, force a final text answer — drop the tool declaration (Anthropic/Gemini) or set
tool_choice: "none"while keeping tools present (OpenAI). Without this, runners loop on tool calls until exhausted and return nothing. - Gemini last-turn: set
thinkingBudget: 0so the small output budget isn't consumed planning more fetches that can't happen. - 6 tool turns is enough for hub → doc chain: AI fetches hub, picks 1-2 docs, fetches them, answers. Anything more usually indicates a content gap rather than a navigation gap.
Harness: github.com/raymindai/memory.wiki /eval. Browse bench: node eval/run-browse-bench.mjs --queries=queries/<hub|bundles|docs>-v1.jsonl.
Siblings: MWBench v1 (paste mode) · Bundle & Doc URL enrichment.
Facts
- Round 7 browse-mode accuracy: Hub 98.3% / Bundle 96.7% / Doc 90% averaged across Claude/OpenAI/Gemini
- Tool-use rate is 100% on every runner — AIs reliably fetch Memory.Wiki URLs when handed them
- Hub URL browse jumped 41.7% → 98.3% after deploying knowledge graph (concept_index + concept_relations + bundle AI graphs) to llms-full.txt
- Doc URL browse jumped 80% → 90% after adding documents.ai_graph column (themes / insights / takeaways / openQuestions) and surfacing it in /raw/
- Browse runners need a "last turn force final answer" pattern — without it, OpenAI/Gemini hit tool loop and return empty
- Browse mode hits the production deploy chain: vendor servers fetch from public internet, so every Memory.Wiki deploy changes browse-mode numbers
- 5/120 residual failures: 1 judge-over-strict, 1 phrasing ambiguity, 3 different runner failures on the same hard query (qd-005 "how long to install /mdfy")
- The complete bench arc (Round 0 → 7): compact 33% → 100% paste / 90-98% browse across all 3 URL shapes and 3 AI vendors