Reading: Karpathy on LLM evals

A good error message answers three questions: what happened, why it happened, and what to try next. Most ship the first, hint at the second, and forget the third. The fix is usually a single sentence longer.

The hardest part of a 1-person startup isn't the work — it's the lack of a forcing function. Without a meeting on Tuesday, nothing has to ship on Monday. The schedule has to come from somewhere, and "because I said so" isn't enough.

Three rules I keep returning to

Ship one feature, deeply, before two features shallowly.
The interface IS the product. The engine just has to keep up.
Anything important should fit on one screen.

python
# Tiny script that prints any URL's title.
import requests, re
def title(url: str) -> str:
    html = requests.get(url, timeout=5).text
    m = re.search(r"<title>(.*?)</title>", html, re.S | re.I)
    return m.group(1).strip() if m else url
print(title("https://memory.wiki"))

"The best note-taking system is the one you already have open." — every productivity post ever, and also true

The thesis here^[1] is that delivery model matters more than retrieval quality.

What changed

The interesting thing about long-context models isn't that they can read more — it's that they finally make the retrieval problem optional. When a model can hold the whole repo in context, the question shifts from "what should I fetch?" to "what should I show?". That's a UX question, not an infrastructure one.

First articulated in the W6 internal note "Graph RAG is delivery, not retrieval." ↩︎