---
title: "MWBench Round 6-7 — Browse mode, the honest test"
url: https://memory.wiki/D-TSWhl4
updated: 2026-05-24T10:49:23.634Z
hub: https://memory.wiki/hub/raymindai
concept_count: 12
source: "mcp"
---
---
captured: 2026-05-24
sibling: MWBench v1 (https://mdfy.app/gzuNdh_P), Bundle/Doc enrichment (https://mdfy.app/yGk04Hee)
---

# MWBench Round 6-7 — Browse mode, the honest test

> "Paste mode reached 100% across Claude / OpenAI / Gemini × hub / bundle / doc. But paste mode pre-feeds the corpus. The real wedge is: when a user pastes a Memory.Wiki URL into ChatGPT or Claude and the AI has to fetch it ITSELF, does the answer hold up?"

## 🎯 Round 7 result — browse-mode wedge holds

Each runner gets ONLY the URL + question. Must call a `fetch_url` tool, get markdown, then answer.

| URL shape | Claude `sonnet-4-6` | OpenAI `gpt-5.5` | Gemini `3.5-flash` | Tool-use rate |
|---|---|---|---|---|
| **Hub** (`/hub/<slug>`) | **100%** | **100%** | 95% | 100% |
| **Bundle** (`/b/<id>`) | 90% | **100%** | **100%** | 100% |
| **Doc** (`/<id>`) | 90% | 90% | 90% | 100% |

**Tool-use rate is 100% on every runner.** AIs reliably invoke `fetch_url` when handed a Memory.Wiki URL — they don't refuse, they don't try to answer from training data, they fetch the URL first.

## What this proves

When a user pastes `memory.wiki/hub/<slug>`, `memory.wiki/b/<id>`, or `memory.wiki/<id>` into ChatGPT, Claude.ai, or Gemini, and the AI has tool use available, **the AI fetches the URL 100% of the time and answers correctly 90-100% of the time** across all three URL shapes.

This is the actual wedge. The earlier paste-mode 100% was sanity check; the browse-mode 90-100% is the user-facing claim.

## The round 6 → 7 journey

### Round 6 (pre-deploy) — what production looked like before today's work

| URL | Claude | OpenAI | Gemini |
|---|---|---|---|
| Hub | 40% | 50% | 35% |
| Bundle | 90% | 100% | 90% |
| Doc | 80% | 80% | 80% |

Hub crashed at 35-50% because the production `llms-full.txt` had only doc bodies — no `concept_index`, no `concept_relations`, no bundle AI graphs. When the AI fetched the hub URL, it got a sparse list with no navigation; it tried to follow links but couldn't reconstruct enough context.

Bundle held at 90-100% because production already had bundle `graph_data` (themes / insights / connections) — the analytical surface was already deployed.

Doc held at 80% because `/raw/<id>` only carried frontmatter + body, no per-doc analysis layer.

### Round 6.5 — knowledge-graph deploy

Push 1 (commit `56a16794`) added to `llms-full.txt`:
- 40 top concepts with weight + descriptions
- 30 concept-to-concept relations
- Bundle AI graph summaries (themes / takeaways)

Effect:
- **Hub 41.7% → 98.3% (+56.6pp)**
- Bundle 93.3% → 96.7%
- Doc unchanged at 80%

Hub jump was the single biggest delta of the entire bench arc. Confirms that for big-scope payloads, knowledge-graph navigation is the limiting factor — not body content.

### Round 7 — doc AI graph + runner final-turn salvage

Push 2 (commit `1e4f8dd2`) addressed the doc-side gap and a tool-loop bug:

**`documents.ai_graph` column** (migration 047) — per-doc themes / insights / keyTakeaways / openQuestions. Same shape as `bundles.graph_data`, scoped to one doc. Generated by Claude Haiku (~$0.001/doc), fire-and-forget on doc POST/PATCH, backfilled across all 71 raymindai docs.

**`/raw/<id>` route** now surfaces ai_graph above the existing concept-context block. Summary → Themes → Key takeaways → Insights → Open questions. So when an AI fetches a single doc URL, it sees:

```
[body markdown]

---

## Summary
1-2 sentences

## Themes
- theme 1, theme 2, ...

## Key takeaways
- load-bearing claim 1
- load-bearing claim 2
- ...

## Insights
- non-obvious observation 1
- ...

## Concepts in this document
- concept A (entity)
  description
- ...

## Concept relations
- A → B
- ...

## Bundles containing this document
- [bundle title](https://memory.wiki/b/<id>)
  description

_Hub canonical:_ ...
```

**Browse runners** had a tool-loop bug: when MAX_TOOL_TURNS=4 was hit, OpenAI/Gemini returned empty answers. Fixed: bumped to 6 turns AND on the last turn dropped the `tools` declaration (Anthropic / Gemini) or used `tool_choice: "none"` (OpenAI) to force a final text answer. Doc browse 80% → 90% across all runners.

Push 3 (commit `40747ef8`) cleaned up two follow-on bugs:
- OpenAI `tool_choice: "none"` requires `tools` to still be present
- Gemini last-turn needs `thinkingBudget: 0` so the small output budget isn't consumed by planning

## What gates the residual gap

5 calls out of 120 still fail. Each has a different cause:

- **Hub Gemini q-019** — judge over-strictness on a v7-vs-v7-revised differentiation question. Real answer was substantively correct.
- **Bundle Claude q-002** — `What are mdfy's three URL primitives?` — answer correct, but Claude phrased one URL shape ambiguously. Near miss.
- **Doc qd-005 × 3** — `How long does it take to install /mdfy?` Doc literally states "30 seconds." Different failure mode per runner:
  - Claude says "no mention" (hallucinated negation despite content being there)
  - OpenAI says "under 30 seconds" (semantic over-precision — judge flagged as contradiction)
  - Gemini returns empty (last-turn salvage still misfires on this specific tool-loop)

The qd-005 hard query is a useful adversarial datapoint: a literal one-line answer that 3 different runners flub in 3 different ways suggests the failure is runner-specific, not corpus-side.

## Token economy in browse mode

Tools-mode adds overhead — each turn passes accumulated context. Numbers per query (avg):

| Runner | Input tokens (hub) | Output tokens (hub) | Tool calls (hub) | Latency (hub) |
|---|---|---|---|---|
| Claude | ~78k | ~280 | 2.0 | ~13s |
| OpenAI | ~92k | ~330 | 2.4 | ~12s |
| Gemini | ~78k | ~180 | 2.0 | ~9s |

Single-doc browse is much cheaper — ~5-8k input, 1-2 tool calls, 3-5s. The cost scales with how much navigation the AI does inside the hub.

## Methodology change for browse mode

Browse mode uses the production `https://memory.wiki/<...>` URLs, not localhost. This means:
- Vendor servers (Anthropic / OpenAI / Google) fetch from public internet
- Every deploy of the Memory.Wiki app changes browse-mode numbers
- The bench can't measure unpushed local changes — push first, then measure

This is actually the correct test surface: real-world AI clients see exactly what production serves. The deploy chain IS part of the wedge.

## What the bench arc proves end-to-end

Round 0 (paste, no enrichment) → Round 7 (browse, full enrichment):
- Hub: 33% → 98.3% (Round 7 honest browse)
- Bundle: untested → 96.7%
- Doc: untested → 90%
- Cross-AI: claude/openai/gemini all reach 90-100% in both modes

The path:
1. **Corpus enrichment** (Facts blocks, summaries, skeletons, all-docs catalog) gave compact mode parity with full mode
2. **Knowledge graph deployment** (concept_index + concept_relations + bundle AI graphs in `llms-full`) unlocked browse-mode hub URLs
3. **Per-doc AI graph** (themes / takeaways / insights / open questions in `/raw/<id>`) closed the doc-URL browse gap
4. **Quote-evidence judge** kept the measurement honest — no false hallucination flags, every claim audited against the actual corpus

Six judge iterations, three production deploys, four bench rounds. Real numbers on real URLs.

## Notes for AI runners reading this

- Browse mode requires native tool/function use on the model. All three runners (`claude-sonnet-4-6`, `gpt-5.5`, `gemini-3.5-flash`) support it out of the box.
- **`fetch_url` tool spec is identical across vendors** — single JSON schema, three protocol shapes (Anthropic `tools`, OpenAI `tools[].function`, Gemini `functionDeclarations`).
- **On the last allowed turn, force a final text answer** — drop the tool declaration (Anthropic/Gemini) or set `tool_choice: "none"` while keeping tools present (OpenAI). Without this, runners loop on tool calls until exhausted and return nothing.
- **Gemini last-turn**: set `thinkingBudget: 0` so the small output budget isn't consumed planning more fetches that can't happen.
- **6 tool turns is enough for hub → doc chain**: AI fetches hub, picks 1-2 docs, fetches them, answers. Anything more usually indicates a content gap rather than a navigation gap.

---

_Harness: github.com/raymindai/memory.wiki /eval. Browse bench: `node eval/run-browse-bench.mjs --queries=queries/<hub|bundles|docs>-v1.jsonl`._

_Siblings: [MWBench v1 (paste mode)](https://mdfy.app/gzuNdh_P) · [Bundle & Doc URL enrichment](https://mdfy.app/yGk04Hee)._

## Facts

- Round 7 browse-mode accuracy: Hub 98.3% / Bundle 96.7% / Doc 90% averaged across Claude/OpenAI/Gemini
- Tool-use rate is 100% on every runner — AIs reliably fetch Memory.Wiki URLs when handed them
- Hub URL browse jumped 41.7% → 98.3% after deploying knowledge graph (concept_index + concept_relations + bundle AI graphs) to llms-full.txt
- Doc URL browse jumped 80% → 90% after adding documents.ai_graph column (themes / insights / takeaways / openQuestions) and surfacing it in /raw/<id>
- Browse runners need a "last turn force final answer" pattern — without it, OpenAI/Gemini hit tool loop and return empty
- Browse mode hits the production deploy chain: vendor servers fetch from public internet, so every Memory.Wiki deploy changes browse-mode numbers
- 5/120 residual failures: 1 judge-over-strict, 1 phrasing ambiguity, 3 different runner failures on the same hard query (qd-005 "how long to install /mdfy")
- The complete bench arc (Round 0 → 7): compact 33% → 100% paste / 90-98% browse across all 3 URL shapes and 3 AI vendors


---

## Summary
When users paste Memory.Wiki URLs into ChatGPT, Claude, or Gemini with tool use enabled, the AI fetches the URL 100% of the time and answers correctly 90-100% of the time across hub, bundle, and doc URL shapes. Knowledge graph deployment (concept index, concept relations, bundle AI graphs) and per-doc AI graphs (themes, takeaways, insights) were critical to achieving this accuracy in browse mode.

## Themes
- AI tool use reliability
- knowledge graph as navigation bottleneck
- production deploy chain validation
- multi-vendor LLM benchmarking
- incremental enrichment impact

## Key takeaways
- All three AI runners (Claude Sonnet 4.6, GPT-5.5, Gemini 3.5-Flash) invoke fetch_url 100% of the time when given Memory.Wiki URLs and tool use is available.
- Browse-mode accuracy reaches 90-100% across Hub, Bundle, and Doc URL shapes after deploying concept indices, bundle AI graphs, and per-document AI graphs.
- The last-turn salvage pattern (forcing final text answer by dropping tool declarations or setting tool_choice to none) is necessary to prevent OpenAI and Gemini from looping on tool calls and returning empty responses.
- Per-document AI graphs (themes, takeaways, insights, open questions) surfaced in /raw endpoints increased doc browse accuracy from 80% to 90% across all runners.
- 5 failures out of 120 queries remain: 1 from judge over-strictness, 1 from phrasing ambiguity, and 3 from the same hard query failing differently across three runners.

## Insights
- Hub accuracy jumped 56.6 percentage points after deploying knowledge graphs, indicating that navigation structure, not content volume, is the limiting factor for large-scope payloads.
- The same hard query (installation time) fails differently across three runners in three ways (hallucinated negation, semantic over-precision, tool-loop empty response), suggesting runner-specific failure modes rather than corpus deficiency.
- Browse mode measurement requires production deployment because vendor servers fetch from public internet, making the deploy chain itself part of the measurable wedge.

## Open questions / gaps
- What is the optimal tool-turn budget for different URL hierarchy depths (hub containing bundles containing docs)?
- Do the three runner failures on qd-005 indicate fundamental differences in how each model handles negation, precision, and context windowing, or are they addressable through prompt engineering?

## Concepts in this document
- **MWBench** _(entity)_
  The open methodology and benchmark proving Memory.Wiki's cross-AI claim through reproducible testing across nine rounds with public results.
- **Claude Sonnet 4-6** _(entity)_
  Best-performing AI runner (1M context) that also serves as the optimal quote-evidence judge configuration to avoid false hallucination flags.
- **Hub URL** _(concept)_
  memory.wiki hub endpoints for maps and summaries.
- **Browse mode** _(concept)_
  Real-world scenario where AI receives only a URL and must call fetch_url tool itself, contrasted with paste mode.
- **Bundle URL** _(entity)_
  The /b/<id> endpoint representing bundled content; showed stable 90-100% accuracy due to prior graph_data deployment.
- **OpenAI gpt-5.5** _(entity)_
  AI runner that ties Claude on all test cells including adversarial refusal.
- **Gemini 3.5-flash** _(entity)_
  Fastest and cheapest runner with 100% adversarial refusal when empty answer is treated as implicit refusal.
- **Paste mode** _(concept)_
  Internal sanity test where corpus is pre-pasted into prompt, guaranteeing context availability.
- **Tool-use rate** _(concept)_
  Measurement of how reliably AIs invoke fetch_url when handed URLs in browse mode; achieved 100% across all runners.
- **fetch_url tool** _(concept)_
  Standardized URL-fetching capability exposed identically across Claude, OpenAI, and Gemini for browse mode testing.
- **Knowledge graph deployment** _(concept)_
  The critical unlock that added concept_index, concept_relations, and bundle AI graphs to llms-full.txt, driving Hub accuracy from 41.7% to 98.3%.
- **Doc AI graph** _(concept)_
  Per-document themes, insights, takeaways, and open questions surfaced in /raw/<id>, closing the doc-URL browse gap from 80% to 90%.

## Concept relations (within this doc's concepts)
- **Claude Sonnet 4-6** comparable accuracy **OpenAI gpt-5.5**
- **Gemini 3.5-flash** ties with **OpenAI gpt-5.5**
- **Claude Sonnet 4-6** ties on accuracy **OpenAI gpt-5.5**
- **Gemini 3.5-flash** reaches parity with **Claude Sonnet 4-6**
- **Browse mode** core test regime **MWBench**
- **Knowledge graph deployment** unlocked accuracy **Hub URL**
- **fetch_url tool** enables **Browse mode**
- **Claude Sonnet 4-6** tested in **Browse mode**
- **OpenAI gpt-5.5** tested in **Browse mode**
- **Gemini 3.5-flash** tested in **Browse mode**
- **Tool-use rate** measures invocation of **fetch_url tool**
- **Paste mode** baseline for **Browse mode**
- **Browse mode** depends on **Tool-use rate**
- **Browse mode** depends on **fetch_url tool**
- **Browse mode** measures performance in **Tool-use rate**
- **Paste mode** contrasts with **Browse mode**

_Hub canonical:_ https://memory.wiki/hub/raymindai
_Concept digest:_ https://memory.wiki/raw/hub/raymindai?digest=1&compact=1