Paperly.
A research-summarization pipeline combining OCR, GraphRAG, and agentic retrieval. Researchers query an academic corpus by question rather than keyword.
The hard problem worth writing about.
A vector store is a fine ranker until the answer is not in a single paper. Most non-trivial research questions sit across two or three citations, separated by a few hops of reference graph and at least one paragraph of unindexed PDF. Dense retrieval alone returns the shape of the answer; it does not assemble it.
The problem we wanted to solve was not faster retrieval. It was retrieval that could explain why a paper was returned: which neighbour cited which claim, which OCR-recovered figure was the corroborating one, and what an agent would do next if asked.
The first prototype was a single Jupyter notebook with two cells, both wrong. The second was a smaller notebook that admitted what it didn't know. Everything after that has been about giving the model a graph it can walk and a harness that disagrees with it.
How we built it.
The pipeline runs in three layers.
The ingestion layer is OCR-first. We use a layout-aware extractor that preserves figure proximity to paragraph claims — important because a figure cited two sections away is often the one carrying the load. The output is structured: paragraphs, captions, citation contexts, and a per-document claim graph.
The graph layer is Neo4j. Nodes are papers, sections, and named claims; edges are cites, citation-context, and figure-of relationships. We deliberately did not collapse claims into the paper node — the whole point is to traverse from claim to claim across papers. The graph is typed, which lets the agent reason about edge semantics, not just adjacency.
The agentic loop is small. A planner proposes one to three retrieval moves; each move executes either a vector lookup or a typed graph traversal; a verifier scores the assembled context against the question; the loop terminates on confidence or on a hop budget. A fine-tuned Llama-3 summarisation head writes the final answer over the assembled context, with inline citations linking back to the source sections.
The eval harness was its own product. Build it before the model — the rest is implementation.
The harness was the deliverable I was proudest of. We built a 400-paper benchmark with question types graded for hop count — single-paper, two-hop, three-hop — and tracked F1@5 plus median end-to-end latency on every commit. Every regression got a name and a row in the test set. By the end, the bench disagreed with the model often enough that we trusted both.
What we got wrong, twice.
Two things humbled us. First, the OCR layer over-trusted page geometry. PDFs from older journals threw off the layout extractor, and the graph filled with phantom edges that looked plausible but pointed at the wrong claim. We added an explicit "OCR confidence" score on each extracted span, and the planner learned to discount low-confidence neighbours when assembling context.
Second, the verifier was too lenient at first. It scored whether the answer was supported but not whether the support was the strongest available. The fix was to score against a retrieved-but-rejected set, not against a null baseline. Median latency rose by 40 ms — the right trade.
Outcome.
The harness teaches you what the model is doing. Build it first; the rest is implementation.
