RFC: hub recall API

Status: shipped.

The endpoint

POST mdfy.app/api/hub/{slug}/recall

Body:

json
{
  "question": "How does cross-AI memory work?",
  "k": 10,
  "level": "doc",
  "rerank": true
}

Returns the top-k matching chunks (or docs), ranked.

How the recall actually works

Embedding lookup. Embed the question with text-embedding-3-small (1536 dim).
Hybrid retrieval. Run two queries in parallel against the user's hub:
- Vector. pgvector cosine match against documents.embedding (HNSW index, ef_search = 40).
- Lexical. Postgres to_tsvector full-text search against documents.fts.
Union + de-dup. Concatenate the top 30 from each, de-dup by doc id. ~30-50 unique candidates.
Reranker (optional). If rerank: true, send the candidates + the question to Anthropic Haiku, which scores each match. Re-sort by Haiku's score, take top-k.
Return. Each result includes: doc id, doc title, doc URL, the matched chunk text, the rank score, and the source (vector / lexical / both).

What's tunable

k — number of results to return. 1-20.
level — "doc" returns whole docs; "chunk" returns specific passages (chunks are pre-computed at ~500 tokens each).
rerank — boolean. Default true. Costs ~300ms p95. False for speed-first paths.
min_score — discard results below a cosine threshold. Useful for "don't return anything if nothing matches."

Auth

The endpoint is publicly callable for public hubs. For restricted/private hubs, the caller has to be the owner OR have an MCP-signed token. Anonymous calls to a private hub return 401.

What it doesn't do

Multi-hop reasoning. No "fetch this, then fetch what it links to, then aggregate." That's a higher-level construct that lives in the caller's loop.
Live recomputation of embeddings. We embed at write time; recall reads from the existing vectors. Staleness is bounded by the longest delay between a doc edit and the embedding-refresh job (currently 30s).
Graph traversal. Recall is flat over chunks. The graph relationships are at the concept level, accessible separately via the concept index.

What's next

Per-hub recall caching. Common queries against a public hub should be cacheable for ~60s.
Streaming results. Today the response waits for the reranker to finish. We could stream the union results as they arrive and replace them as the reranker scores them. Tradeoff: more complex client code.
Configurable embedding model. Currently hardcoded to OpenAI ada-3-small. Worth exposing if we ever support a non-OpenAI default.