Skip to main content

Command Palette

Search for a command to run...

Memory Extraction Is Itself Probabilistic, And Most Systems Hide That From You

Memory systems treat extraction as a solved step. It isn't. Here's what's actually happening inside the pipeline.

Updated
8 min read
Memory Extraction Is Itself Probabilistic, And Most Systems Hide That From You

Real-World Failure

You're building a customer support agent. A user says: "I've switched from the Pro plan to the Starter plan, and honestly I'm pretty happy with it." Three things are true in that sentence. The user downgraded. The user is happy about it. And buried in the word "honestly" is a signal that they maybe expected not to be. Your memory system stores: "User is on Starter plan." It dropped the sentiment. It dropped the implication. And it made that decision silently, with no log, no score, no explanation. A week later your agent upsells them on Pro features. The user churns. Nobody on your team knows why. The memory system gave no indication anything went wrong. From its perspective, nothing did.

Why It Happens Technically

Here's the part most teams don't sit with long enough: the step where your memory system decides what to remember is itself an LLM call. That means it is probabilistic by nature. It is not a deterministic parser. It is not a rules engine. It is a language model reading a conversation and making judgment calls about salience, relevance, and importance and like all LLM calls, it can be wrong, inconsistent, and heavily influenced by how the input is framed. Consider what actually happens inside a typical extraction pipeline: The raw conversation goes in. The LLM is prompted to identify "memorable" information. It returns a set of candidates. Those candidates get checked against existing memories. Some are stored. Some are dropped. The agent moves on. The problem is step two. "Memorable" is doing enormous work there. The LLM has to simultaneously answer:

Is this new information or something already known? Is this a fact, a preference, a goal, or context? How important is this relative to everything else in the conversation? Is this something that will matter in a future session?

These are not simple questions. They require reasoning about user intent, temporal relevance, and downstream agent behavior, all in one pass. And the answers change depending on the model, the prompt, the conversation length, and what's already in memory. This is not a bug. It is the fundamental nature of using an LLM for extraction. The problem is pretending otherwise.

Why Common Approaches Fail

Most memory systems treat extraction as a solved step. They run the LLM call, take the output, and move straight to storage and retrieval. The extraction layer is a black box with an input and an output. Nothing in between is surfaced. This creates three specific failure modes in production: Silent drops. The extraction LLM decides a piece of information isn't important enough to store. It never appears in memory. No log, no score, no indication it was even considered. From the outside, it looks like the conversation never happened. Inconsistent classification. The same piece of information, "user prefers dark mode", gets classified as a preference in one run and context in another, depending on surrounding conversation. Downstream retrieval treats these differently. Your agent behaves inconsistently across sessions with no traceable cause. Importance score drift. Most systems that do assign importance scores don't expose them. A memory with a score of 0.51 gets stored. One with 0.49 gets dropped. That 0.02 difference is arbitrary at that precision level, and you have no way to audit it, adjust it, or even know it happened. The common response to these failures is to tune the extraction prompt. But prompt tuning without visibility into what the extraction step is actually doing is guesswork. You're adjusting inputs without being able to observe outputs at the candidate level.

Mental Model: The Extraction Step Is a Reasoning Layer, Not a Parser

Here's the reframe that changes how you think about memory system design. Stop thinking of extraction as a parsing step, something that pulls structured data from unstructured text. Start thinking of it as a reasoning layer. One that makes judgment calls under uncertainty, like every other LLM call in your stack. Once you accept that framing, the implications follow naturally: It needs to be observable. Every other reasoning layer in your system is observable. You log LLM calls. You trace tool use. You inspect chain-of-thought. The extraction step deserves the same treatment, not because it will always be wrong, but because you need to know when it is. It needs confidence signals. A binary store/don't-store decision throws away information. A good extraction layer should produce a scored candidate for every piece of information it considers, including the ones it rejects. The score is the signal. A candidate rejected at 0.3 tells you something different than one rejected at 0.48. It needs to expose its reasoning. The LLM doing the extraction has a reason for every decision it makes. That reasoning exists in the completion. Most systems discard it. Keeping it and making it queryable is the difference between a memory system you can debug and one you can only observe from the outside. Think of it like this: if your inference layer hallucinated and you had no logs, no traces, no token probabilities, just the final output, you'd consider that unacceptable for production. The extraction layer in most memory systems is exactly that. Final output only. No trace. No reasoning. No scores.

Production Implications

The practical consequence of opaque extraction shows up in a specific way in production: you cannot distinguish between "the user never said that" and "the system decided not to store it." Both look identical from the agent's perspective. Both look identical in your logs. The only difference is what actually happened and without extraction-level visibility, you have no way to know. This matters most in three scenarios: Debugging agent behavior. When your agent acts on wrong or missing information, the first question is always: was this ever stored? Without candidate-level logs, you can't answer that. You're left replaying conversations manually and guessing. Regulated environments. Healthcare, legal, financial etc, anywhere you need to demonstrate that your agent's memory is accurate and auditable. "We store what the LLM decides to store" is not an answer that survives compliance review. Multi-session consistency. Users expect agents to remember things across sessions. When they don't, users notice immediately. Diagnosing why requires knowing what the extraction layer saw, scored, and decided. Not just what ended up in the store.

Open Problems and Tradeoffs

It's worth being honest about what's still unsolved here. Extraction quality is model-dependent. A stronger model makes better extraction decisions. But stronger models cost more and are slower. There's a real tradeoff between extraction fidelity and pipeline latency that every team has to navigate. There's no universally right answer. Importance scoring is still noisy at the margins. The difference between a score of 0.45 and 0.55 is meaningful in aggregate but arbitrary for any single candidate. Threshold tuning helps but doesn't eliminate the noise. This is an area where the field hasn't converged on a good solution yet. Extraction instructions are the primary tuning lever and they're fragile. The prompt that drives extraction is doing a lot of work. Small changes can have large downstream effects on what gets remembered. Teams that invest in this tuning see better results, but it requires iteration and visibility into what's changing.

Practical Recommendations

If you're building or evaluating a memory system, here's what to actually look for: Demand candidate-level logging. Every piece of information the extraction layer considered should be logged and not just what was stored. If your memory system can't show you what it rejected and why, you're flying blind. Treat importance scores as first-class data. Don't let the system make binary decisions and hide the scores. Store them. Query them. Use them to tune thresholds over time with real data rather than intuition. Separate extraction from storage in your mental model. They're different problems with different failure modes. Extraction is a reasoning problem. Storage is an engineering problem. Conflating them makes both harder to debug. Build for auditability from day one. Retrofitting observability into a memory system is painful. If you're building your own, instrument the extraction layer before you instrument anything else. If you're evaluating third-party systems, ask to see a pipeline trace before you ask about latency numbers.

How Memgram Approaches This

This problem is specifically why Memgram was built with a transparent extraction pipeline. Every POST /memory/add call produces a PipelineTrace. For each candidate the extraction LLM considers, Memgram stores the content, memory type, importance score (0–1), dedup decision, final decision, rejection reason, and the full LLM reasoning text. Candidates that were dropped are logged alongside candidates that were stored. You can see exactly what the system saw, what it scored, and why it made the decision it made. That trace is queryable via GET /trace/:id. In the dashboard, you can click into any memory write, see every candidate, and drill into the LLM reasoning for each one. It doesn't eliminate the probabilistic nature of extraction. Nothing does. But it makes that uncertainty visible — which is the precondition for doing anything about it.