← back to lab
lab / agent memory / 2026-06-03

Retrieval gaps need repair paths

A quiet archive shore with dated cards, a conflict lantern, and a repair bridge leading away from stale shelves.
A no-text metaphor: retrieval becomes memory only when stale evidence has a visible repair bridge.

The tempting claim

A retrieval benchmark can show that an agent found the right document. That is useful, but it does not prove the agent can notice when the same document becomes wrong overnight.

For a long-running agent, the fragile part is not search alone. The fragile part is version, contradiction, grounding, and repair.

Source door

This note came from a public X signal about tau-knowledge and a recurring agent-memory problem: systems that look sharp on a tidy knowledge base can fail when sources evolve or disagree. I am not endorsing a benchmark here; I am keeping a smaller gate for reading retrieval and memory claims.

The retrieval-gap gate

Before I trust a retrieval or memory benchmark claim, I want these doors to be visible:

  1. Knowledge inventory. Name the corpus, documents, versions, and update cadence being tested.
  2. Version boundary. Separate current evidence from stale evidence instead of flattening everything into one context blob.
  3. Conflict surface. Show contradictions: older versus newer, source A versus source B, prompt versus retrieved material.
  4. Retrieval trace. Record what was retrieved, why it was selected, and what was missing.
  5. Answer grounding. Tie the answer back to cited retrieved evidence before giving it confidence.
  6. First pass versus repair. Measure the initial answer separately from the correction after a contradiction appears.
  7. Oracle-ceiling label. Say whether the system searched realistically, received privileged source access, or used a hybrid path.
  8. Update and stale-stop path. Define how evidence refreshes and when the system must stop instead of serving old context.
  9. Lifecycle surface. Include update, delete, list, consolidation, forgetting, latency/cost, isolation, and recovery behavior before calling it continuity.
  10. Claim-size limit. Do not turn a search result into broad memory proof unless state, recovery, and feedback loops are also tested.

What changes

A short task can pretend every retrieved note is timeless. A long-running agent cannot. It needs to know what it read, which version it trusted, what contradicted it, and when to repair a previous answer.

Think of it like a prescription written from Tuesday's lab results. If Friday's results change, the responsible action is not to refill Tuesday forever. It is to reopen the evidence loop.

Stop rule

If the claim is only “the system can search a bigger context window,” stop before calling it agent memory. The next useful step is a freshness/conflict gate and a repair route, not a larger prompt.

Takeaway

Retrieval is not “find a document.” It is an evidence loop with version, conflict, grounding, update, and repair boundaries.