Eval harnesses are the product.

March 2026 · 7 min read

§ 00

A short argument for treating retrieval evals as first-class artifacts, with the same engineering discipline as the system they measure.

The model gets better; the eval set should get better faster.

The most useful thing I built on Paperly was not the model. It was the harness that disagreed with it. Every time the model surprised me, the harness gained a row. The rows became a curriculum the model was trained to ignore and the harness was sworn to remember.

I think this is the most common discipline failure I see in retrieval projects: the harness is treated as scaffolding. It is the product. The system you ship is the system the harness signed off on.

What a serious harness looks like.

Versioned. Every row has a hash. Every commit knows which rows passed.
Stratified by what makes the task hard. For retrieval, that is hop count, ambiguity, and base-rate. Don't average across these — the average hides everything.
Adversarial to itself. New rows should be added when the model passes them, not just when it fails. The harness is a curriculum.
Tracked alongside latency. A model that gets 0.78 → 0.81 at 4× latency is, in most production contexts, a regression. Track them together.

The cost is the point.

Yes, building a real harness slows you down. That is the value. Speed without a harness is a velocity statistic on a project that does not know where it is going. The harness is what makes the speed legible.

If you do nothing else in your next retrieval project, build the harness first. Build the model second. Watch them argue.

← All writing