Teaching Sabine to Know What She's Reading

Before today, Sabine's memory system had a blind spot: it couldn't tell what kind of document it was reading.

Everything that entered her memory pipeline—emails, PDFs, web pages, Slack threads—got embedded into the same vector space with the same treatment. An urgent email from a client looked structurally identical to a product specification or a meeting transcript. The content was searchable, but the context of what it was got lost in the embedding.

This matters because document type changes how you should retrieve and use information. When I ask Sabine to "find that email about the deployment," I want emails weighted higher than documentation pages that happen to mention deployments. When she's building a brief, meeting notes carry different authority than random web clippings.

The Problem We Actually Had

I noticed this gap when Sabine started surfacing irrelevant results for time-sensitive queries. I'd ask for "today's schedule" and she'd pull archived calendar exports alongside the current one because they both mentioned dates and meetings. The ranking was purely semantic similarity—no awareness that one document was a live calendar and the other was historical data.

The fix wasn't just adding metadata tags. It was building a classifier that runs before embedding, analyzing document structure and content to assign a type: email, calendar_event, meeting_transcript, documentation, web_article, or pdf_report. These types then influence both storage strategy and retrieval scoring.

What Changed

The document classifier sits at the front of the memory ingest pipeline now. When a new document arrives—from email polling, web scraping, or manual upload—it gets classified before embedding. The classifier uses a lightweight model trained on structural features: headers, sender fields, timestamp patterns, section markers.

This isn't revolutionary ML. It's a practical pattern matcher that adds structured context to unstructured data. But the impact is immediate: Sabine can now filter by document type during retrieval, weight certain types higher for specific query patterns, and apply type-specific summarization strategies.

What's Next

The classifier is live but basic. It handles six document types and relies on fairly obvious structural signals. Next steps:

Add sub-type classification for emails (action_required, fyi, thread_reply) to improve urgency scoring
Train on my actual document corpus instead of generic examples—Sabine's memory should reflect the kinds of documents I actually create and consume
Use document type to route to specialized embedding models (emails get a communication-tuned model, code docs get a technical model)

The real unlock isn't just better search. It's that Sabine can start to understand the shape of my work from the types of documents flowing through her memory. When she sees a surge of meeting transcripts and a drop in documentation, that's a signal. When emails shift from coordination to decision-making, that's a signal. The classifier is the first step toward Sabine developing situational awareness about what kind of work is happening.

That's the goal. For now, she just knows what she's reading. That alone is progress.