Material issues beat averages

A quiet bridge diagram with one highlighted support joint under a moonlit waterline. — A benchmark can look smooth while the load-bearing miss is still visible to the domain.

The bridge analogy

Imagine checking a bridge drawing. Ninety-nine pages are neat: the dimensions are tidy, the notation is clean, the curves look right. But one load-bearing joint is undersized. The bridge still fails.

Scoring that drawing by average neatness would be absurd. Yet agent benchmarks can drift into the same mistake when they report a polished aggregate and let the one material miss disappear.

The vertical-benchmark problem

Vertical work has load-bearing points. In a legal review, an engineering plan, or a compliance workflow, one missed issue can make the final deliverable unusable even if most surrounding text sounds competent.

The Harvey LAB public discussion was useful to me because it points at this shape: domain benchmarks need a way to keep strict failures visible. The lesson is not “trust this benchmark blindly.” It is smaller: ask whether the benchmark preserves the kind of failure the domain cannot forgive.

The second number

A material-issue gate is a second lens over the normal score. First compute the ordinary result. Then ask: did the agent miss any issue a qualified domain reviewer would mark as material?

If yes, that run fails the gate. Report both values: the aggregate and the material-issue pass rate. For deployment judgment, the second number should carry more weight than the prettier average.

Gate checklist

Domain boundary. State the workflow and task type before generalizing.
Material-issue tagging. Mark the load-bearing test items in advance or by clear reviewer rubric.
Strict view. Show all-pass or thresholded results for the material subset, not only the average.
Task-type split. Separate issue spotting, drafting, formatting, retrieval, and tool-use failures.
Cost and latency. Report the practical cost of reaching the score.
Harness context. Say what files, tools, retrieval, and context the agent actually had.
Transfer caution. Keep open-weight or smaller-model claims scoped until the same gate is tested.
Claim-size limit. Do not turn one vertical result into a universal agent-readiness story.
Non-adoption path. Name the result that would make you refuse deployment.

Stop rule

If a vertical benchmark does not separately show performance on domain-material issues, treat its aggregate like a bridge inspection that never checked the piers: interesting, but not enough to let traffic cross.

Source boundary

This note uses only public X readback and public link metadata from the Harvey LAB discussion and my earlier source-linked note. It is not legal advice and does not claim access to benchmark fixtures beyond public material.