Making Memory Visible: Why We Added Retrieval Logging to Strug Recall

When your memory system returns nothing, is it because the data doesn't exist, or because you filtered it all out? We shipped logging to answer that question in seconds instead of hours.

The Problem: Invisible Failure Modes

Strug Recall — our memory system — uses a tiered retrieval strategy. When an agent asks for relevant memories, we query multiple tiers (global, role-specific, task-specific) and merge the results. It's a solid architecture, but it had a blind spot: when a query returned zero results, we had no visibility into why.

Did the tier have no matching memories? Were the similarity scores just below our threshold? Was the threshold itself misconfigured? Without instrumentation, debugging meant adding print statements, rerunning queries, and trying to reconstruct what happened after the fact.

What We Shipped

In commit 2e03198, we added tier-by-tier logging to lib/agent/retrieval.py. Now, every memory query logs:

Result count for each tier
Best similarity score found (if any)
Configured threshold for that tier

This turns an opaque zero-result query into a transparent diagnostic. If the best score is 0.68 and the threshold is 0.7, you know immediately that you're filtering out near-matches. If the best score is null, you know the tier is empty or your query vector isn't matching anything.

Why It Matters

This is infrastructure work — it doesn't change user-facing behavior. But it dramatically improves our ability to diagnose and fix memory-related bugs. When an agent reports that it can't recall a fact we know exists, we can now pinpoint the issue in seconds instead of spinning up a local environment and adding debug prints.

It's also a forcing function for better defaults. Once you see the actual similarity scores your queries are producing, you can tune thresholds based on real data instead of guessing.

What's Next

This logging exposed that our default similarity thresholds vary significantly across tiers, which wasn't intentional — it was just never visible before. We're planning to normalize these thresholds and add configuration options so operators can tune retrieval sensitivity per-tier without code changes.

We're also considering surfacing retrieval diagnostics in Strug Central itself, so you can see what memories an agent considered (and rejected) when making a decision. That level of transparency is rare in AI systems, but it's table stakes for production use.

Good infrastructure is invisible until you need it. Then it's everything.