Reading: Karpathy on LLM evals

A good error message answers three questions: what happened, why it happened, and what to try next. Most ship the first, hint at the second, and forget the third. The fix is usually a single sentence longer.

The hardest part of a 1-person startup isn't the work — it's the lack of a forcing function. Without a meeting on Tuesday, nothing has to ship on Monday. The schedule has to come from somewhere, and "because I said so" isn't enough.

A good error message answers three questions: what happened, why it happened, and what to try next. Most ship the first, hint at the second, and forget the third. The fix is usually a single sentence longer.

Three rules I keep returning to

  • Ship one feature, deeply, before two features shallowly.
  • The interface IS the product. The engine just has to keep up.
  • Anything important should fit on one screen.
python
# Tiny script that prints any URL's title. import requests, re def title(url: str) -> str: html = requests.get(url, timeout=5).text m = re.search(r"<title>(.*?)</title>", html, re.S | re.I) return m.group(1).strip() if m else url print(title("https://memory.wiki"))

"The best note-taking system is the one you already have open." — every productivity post ever, and also true

The thesis here[1] is that delivery model matters more than retrieval quality.

What changed

The interesting thing about long-context models isn't that they can read more — it's that they finally make the retrieval problem optional. When a model can hold the whole repo in context, the question shifts from "what should I fetch?" to "what should I show?". That's a UX question, not an infrastructure one.


  1. First articulated in the W6 internal note "Graph RAG is delivery, not retrieval." ↩︎