One of the most requested features from early Sabine users has been simple: "Can she learn from my corrections?" Today, we're excited to share that the foundation for that capability just shipped.
The Problem: Feedback Without Memory
Large language models are incredibly capable, but they don't inherently improve from usage. When you thumbs-down a response or correct Sabine's answer, that signal historically disappeared into the void. The model serving your next request has no idea what worked or what didn't.
We knew we needed a way to capture these preference signals—not just for analytics, but as structured training data that could eventually fine-tune Sabine's behavior.
What We Shipped
This release includes three key components:
- Database schema: A new preference_signals table in Supabase that stores user feedback with full context—session ID, message content, signal type (thumbs up, thumbs down, correction, re-ask), and metadata.
- Chat UI component: A ThumbsFeedback React component that appears on hover for every assistant message, letting users vote or flag responses instantly.
- DPO export script: A Python script (export_dpo_pairs.py) that reads preference signals, pairs positive and negative examples for the same prompt, and outputs JSONL files ready for Direct Preference Optimization training.
Direct Preference Optimization (DPO) is a reinforcement learning technique that fine-tunes models by showing them pairs of responses—one preferred, one rejected—for the same user prompt. It's simpler and more stable than traditional RLHF approaches, and it's how models like GPT-4 and Claude learn to align with human preferences.
How It Works
When you thumbs-down a response, the system records the original answer as a "rejected" example. If you later thumbs-up a different response to a similar question, the export script pairs them automatically. For corrections, where you provide a better answer, the script uses your correction as the "chosen" response and the original as "rejected."
The export process includes deduplication, quality filtering (minimum prompt length, non-empty responses, chosen ≠ rejected), and smart pairing logic that combines incomplete signals from the same session.
What This Means for Users
Right now, this is infrastructure. Your feedback is being captured and stored, but Sabine isn't actively learning from it yet. Think of this as Phase A of a multi-turn reinforcement learning system.
The immediate benefit: your corrections and preferences are no longer lost. Every thumbs-down, every re-ask, every correction is logged. As we accumulate signals, we'll have a high-quality dataset that reflects real usage patterns—not synthetic benchmarks or crowd-sourced annotations.
What's Next
Phase B will focus on active learning triggers—automatically identifying when Sabine is uncertain and proactively asking for feedback before responding. Phase C will close the loop: running DPO fine-tuning on accumulated preference data and deploying improved models.
We're setting a target of 500 preference pairs before running the first fine-tuning experiment. Based on early usage, we expect to hit that threshold within 2-3 weeks of production traffic.
In the meantime, if you're using Sabine and see a response that's off, don't just skip it—thumbs-down it. If you know the right answer, correct it. Those signals are now part of Sabine's permanent learning record.
This is what autonomous product development looks like: shipping the instrumentation before the feature, building feedback loops before scaling, and treating user corrections as first-class training data.