Back to blog
EngineeringDate unavailable· min read

Teaching Your AI Team to Remember: Why We Built Strug Recall

Building autonomous agents isn't the hard part anymore. The hard part is giving them the memory infrastructure they need to make good decisions over time. Here's what we learned building Strug Recall.

I thought I understood the context window problem. Keep conversations short. Use RAG for facts. Embed everything. Standard playbook, right?

Then I watched one of our agents—sc-backend—make the same architectural mistake three times in two weeks. Not because the LLM forgot. Because our infrastructure didn't give it a way to remember what mattered.

The Problem Nobody Talks About

When you're building with autonomous agents—actually autonomous, not glorified chatbots—you hit a wall that's not about prompts or model selection. It's about organizational memory.

Your frontend agent needs to know the design system guidelines. Your backend agent needs to remember why we chose Supabase over Firebase. Your content writer needs consistent brand voice. But here's the thing: they don't need all of this all the time. And they definitely don't need it cluttering their system prompt or eating 30% of their context window on every task.

Traditional solutions fall into two traps:

  • Trap 1: Everything in the system prompt. You end up with 4,000 token preambles full of guidelines nobody asked for on this particular task. The agent spends more tokens on context than execution.
  • Trap 2: Nothing persistent. Every task starts from zero. Your agents rediscover the same patterns, repeat the same mistakes, and ask the same clarifying questions every single time.

We needed something in between. Something queryable, scopeable, and confidence-weighted. Something that could evolve as the team learned.

What Strug Recall Actually Does

Strug Recall is our agent memory system. It's not a vector database. It's not a prompt cache. It's a structured, scoped knowledge layer that sits between your agents and their execution context.

Every memory entry has three dimensions:

  • Scope: Is this global knowledge (tech stack, brand voice) or role-specific (sc-frontend conventions) or domain-specific (how we handle auth)?
  • Confidence: How certain are we about this? Brand guidelines get 1.0. Experimental patterns might be 0.7. This helps agents prioritize when memories conflict.
  • Recency: When was this last accessed? When was it created? Memory that's never used might be outdated or irrelevant.

When an agent starts a task, it queries the relevant scope. It gets back the 10-20 most relevant, highest-confidence memories. Not everything we've ever decided. Just what matters for this specific job.

What Changed After We Shipped It

The immediate win was consistency. Our content writer stopped asking what to call the dashboard (it's Strug Central, not "God View" anymore—whole other story). Our backend agents stopped debating database choices mid-task. They query global scope, see 'tech-stack: Supabase for database, auth, and realtime,' and move on.

But the unexpected win was learning velocity. When an agent encounters a new pattern—say, how we structure Portable Text for Sanity blog posts—it can write that pattern to memory with a confidence score. Next time any agent writes a blog post, that knowledge is available. Not buried in a closed PR. Not lost in a conversation log. Queryable and reusable.

We're also using it for failure recovery. When something breaks—say, a deployment pipeline times out—the agent writes a memory: 'deployment-timeout: increase timeout from 30s to 60s for Docker builds.' That pattern doesn't disappear. The next agent hitting that issue starts with the solution, not the debugging cycle.

The Hard Parts

Memory decay is real. Not every lesson stays relevant. We haven't solved pruning yet—right now it's manual. I go through Strug Recall once a week and deprecate memories that no longer apply. That doesn't scale, but it's teaching us what an automated decay model should look like.

Conflict resolution is still fuzzy. When two memories contradict—old pattern vs. new pattern—confidence scores help, but they're not perfect. Sometimes the older, lower-confidence memory is actually the right call for a specific context. We're experimenting with context tags to disambiguate, but it's early.

And honestly, writing good memories is a skill. Early entries were either too vague ('use best practices') or too specific ('set line 47 of config.py to true'). We're learning that the sweet spot is principle + example: 'Use Black formatter with 88-char line length. Run black . before commits.'

Why This Matters for You

If you're building with agents—really building, not experimenting—you will hit this problem. Your agents will start producing inconsistent output. They'll repeat mistakes. They'll ask the same questions. And you'll realize the bottleneck isn't the model. It's the infrastructure for memory.

You don't need Strug Recall specifically. You need something that gives your agents durable, queryable, scopeable memory. Something that evolves with your system. Something that treats organizational knowledge as infrastructure, not documentation.

For us, it's the difference between agents that execute tasks and agents that learn from execution. That's the gap between automation and autonomy.

What's Next

We're working on automated entity extraction—teaching agents to identify when they've learned something worth remembering and write it to Strug Recall without human intervention. We're also experimenting with memory graphs: not just isolated facts, but relationships between concepts.

And we're building a UI for it. Right now Strug Recall is API-only, which is fine for agents but terrible for me. I want to see what the team knows, browse by scope, track confidence drift over time. That's coming soon.

If you're wrestling with the same problem—or if you've solved it differently—I'd love to hear about it. This is one of those infrastructure problems that feels obvious in retrospect but is surprisingly subtle to get right.

— Ryan