RFC: hub recall API
Status: shipped.
The endpoint
POST mdfy.app/api/hub/{slug}/recall
Body:
json{
"question": "How does cross-AI memory work?",
"k": 10,
"level": "doc",
"rerank": true
}
Returns the top-k matching chunks (or docs), ranked.
How the recall actually works
- Embedding lookup. Embed the question with
text-embedding-3-small(1536 dim). - Hybrid retrieval. Run two queries in parallel against the user's hub:
- Vector. pgvector cosine match against
documents.embedding(HNSW index,ef_search = 40). - Lexical. Postgres
to_tsvectorfull-text search againstdocuments.fts.
- Vector. pgvector cosine match against
- Union + de-dup. Concatenate the top 30 from each, de-dup by doc id. ~30-50 unique candidates.
- Reranker (optional). If
rerank: true, send the candidates + the question to Anthropic Haiku, which scores each match. Re-sort by Haiku's score, take top-k. - Return. Each result includes: doc id, doc title, doc URL, the matched chunk text, the rank score, and the source (vector / lexical / both).
What's tunable
k— number of results to return. 1-20.level—"doc"returns whole docs;"chunk"returns specific passages (chunks are pre-computed at ~500 tokens each).rerank— boolean. Default true. Costs ~300ms p95. False for speed-first paths.min_score— discard results below a cosine threshold. Useful for "don't return anything if nothing matches."
Auth
The endpoint is publicly callable for public hubs. For restricted/private hubs, the caller has to be the owner OR have an MCP-signed token. Anonymous calls to a private hub return 401.
What it doesn't do
- Multi-hop reasoning. No "fetch this, then fetch what it links to, then aggregate." That's a higher-level construct that lives in the caller's loop.
- Live recomputation of embeddings. We embed at write time; recall reads from the existing vectors. Staleness is bounded by the longest delay between a doc edit and the embedding-refresh job (currently 30s).
- Graph traversal. Recall is flat over chunks. The graph relationships are at the concept level, accessible separately via the concept index.
What's next
- Per-hub recall caching. Common queries against a public hub should be cacheable for ~60s.
- Streaming results. Today the response waits for the reranker to finish. We could stream the union results as they arrive and replace them as the reranker scores them. Tradeoff: more complex client code.
- Configurable embedding model. Currently hardcoded to OpenAI ada-3-small. Worth exposing if we ever support a non-OpenAI default.