Skip to content
§ 00
ESSAY / MAY 2026

Learning GraphRAG by building Paperly.

May 2026 · 11 min read
§ 00

Notes on the moment a vector database stops being enough — and what graph traversal actually buys you when you commit to it.

The moment the vector store breaks.

There is a specific kind of question that dense retrieval cannot answer. It is the kind whose answer lives across two or three papers, joined by a citation, a shared author, or a re-used dataset. The vector store will return all three documents with respectable cosine scores, and a competent summariser will produce a respectable paragraph that averages their contents. The paragraph is wrong in a quiet way — it merges claims that were never meant to be merged.

I noticed this for the first time in the second week of building Paperly. The corpus was a couple hundred ML papers, the question was "does retrieval-augmented generation improve factual grounding more on long-tail entities than on head entities?", and the system confidently said yes on the basis of three papers that, on close reading, did not actually share that conclusion. Two of them said the opposite for different reasons.

What a graph actually buys you.

The fix was to stop pretending the corpus was flat. Papers cite each other; sections inside papers reference figures; figures appear in tables; tables get re-used. That is a graph, with typed edges, and almost no flat retriever uses it.

GraphRAG, then, is the practice of:

  1. Treating the corpus as a directed multigraph with typed edges (cites, figure-of, re-uses-dataset, author-of).
  2. Letting the retrieval policy traverse the graph rather than just rank documents.
  3. Letting the type of an edge inform the agent's next move.

You stop assembling context by similarity and start assembling it by structure.

The 2-hop benchmark.

The most useful exercise during the build was constructing a 2-hop benchmark — questions whose answers required following exactly one citation. Single-paper questions are easy and uninformative; three-hop questions are noisy. Two-hop questions are where the system breaks honestly: you can see whether the agent took the right edge, the right kind of edge, and what it did when the obvious neighbour was the wrong one.

By the end of the build, F1@5 on the 2-hop slice was 0.71 versus 0.54 for the dense-only baseline. The headline number is misleading on its own. The real value is that the failure modes of the graph-augmented system are legible — when it fails, you can tell which hop went wrong.

The bit nobody writes about.

The graph is the easy part. The work that does not appear in any paper is the OCR-to-graph pipeline. PDFs are not text; they are page layouts that occasionally contain text in roughly the order the author intended. Building the layout-aware extractor — the thing that decides which paragraph a figure caption belongs to — was where the project lived for two months.

If I were starting again, I would build the OCR pipeline before touching the retrieval policy. The graph quality bounds everything downstream.

← All writing