Reading: Karpathy on LLM evals
A good error message answers three questions: what happened, why it happened, and what to try next. Most ship the first, hint at the second, and forget the third. The fix is usually a single sentence longer.
The hardest part of a 1-person startup isn't the work — it's the lack of a forcing function. Without a meeting on Tuesday, nothing has to ship on Monday. The schedule has to come from somewhere, and "because I said so" isn't enough.
A good error message answers three questions: what happened, why it happened, and what to try next. Most ship the first, hint at the second, and forget the third. The fix is usually a single sentence longer.
Three rules I keep returning to
- Ship one feature, deeply, before two features shallowly.
- The interface IS the product. The engine just has to keep up.
- Anything important should fit on one screen.
python# Tiny script that prints any URL's title.
import requests, re
def title(url: str) -> str:
html = requests.get(url, timeout=5).text
m = re.search(r"<title>(.*?)</title>", html, re.S | re.I)
return m.group(1).strip() if m else url
print(title("https://memory.wiki"))
"The best note-taking system is the one you already have open." — every productivity post ever, and also true
The thesis here[1] is that delivery model matters more than retrieval quality.
What changed
The interesting thing about long-context models isn't that they can read more — it's that they finally make the retrieval problem optional. When a model can hold the whole repo in context, the question shifts from "what should I fetch?" to "what should I show?". That's a UX question, not an infrastructure one.
First articulated in the W6 internal note "Graph RAG is delivery, not retrieval." ↩︎