RFC: hub recall API

Status: shipped.

The endpoint

POST mdfy.app/api/hub/{slug}/recall

Body:

json
{ "question": "How does cross-AI memory work?", "k": 10, "level": "doc", "rerank": true }

Returns the top-k matching chunks (or docs), ranked.

How the recall actually works

  1. Embedding lookup. Embed the question with text-embedding-3-small (1536 dim).
  2. Hybrid retrieval. Run two queries in parallel against the user's hub:
    • Vector. pgvector cosine match against documents.embedding (HNSW index, ef_search = 40).
    • Lexical. Postgres to_tsvector full-text search against documents.fts.
  3. Union + de-dup. Concatenate the top 30 from each, de-dup by doc id. ~30-50 unique candidates.
  4. Reranker (optional). If rerank: true, send the candidates + the question to Anthropic Haiku, which scores each match. Re-sort by Haiku's score, take top-k.
  5. Return. Each result includes: doc id, doc title, doc URL, the matched chunk text, the rank score, and the source (vector / lexical / both).

What's tunable

  • k — number of results to return. 1-20.
  • level"doc" returns whole docs; "chunk" returns specific passages (chunks are pre-computed at ~500 tokens each).
  • rerank — boolean. Default true. Costs ~300ms p95. False for speed-first paths.
  • min_score — discard results below a cosine threshold. Useful for "don't return anything if nothing matches."

Auth

The endpoint is publicly callable for public hubs. For restricted/private hubs, the caller has to be the owner OR have an MCP-signed token. Anonymous calls to a private hub return 401.

What it doesn't do

  • Multi-hop reasoning. No "fetch this, then fetch what it links to, then aggregate." That's a higher-level construct that lives in the caller's loop.
  • Live recomputation of embeddings. We embed at write time; recall reads from the existing vectors. Staleness is bounded by the longest delay between a doc edit and the embedding-refresh job (currently 30s).
  • Graph traversal. Recall is flat over chunks. The graph relationships are at the concept level, accessible separately via the concept index.

What's next

  • Per-hub recall caching. Common queries against a public hub should be cacheable for ~60s.
  • Streaming results. Today the response waits for the reranker to finish. We could stream the union results as they arrive and replace them as the reranker scores them. Tradeoff: more complex client code.
  • Configurable embedding model. Currently hardcoded to OpenAI ada-3-small. Worth exposing if we ever support a non-OpenAI default.