memgram

The Moment Memory Starts Guessing, It Stops Being Memory

Vibhanshu Karn — Fri, 29 May 2026 13:36:16 GMT

Real-World Failure

Your agent handles customer onboarding. A user mentioned three months ago that they were evaluating your product for a healthcare team. They never said they completed the evaluation. They never said they bought. They never said they're still in healthcare.

Your memory system stored: "User works in healthcare."

Six months later the user asks a billing question. Your agent responds with HIPAA-compliant data handling assurances — unsolicited, confident, specific.

The user is now at a fintech startup. They switched jobs two months ago. They find the response confusing, slightly alarming, and they lose trust in the agent.

Nobody touched the memory. The system didn't malfunction. It remembered correctly — and then reasoned beyond what it remembered. That's the failure.

Why It Happens Technically

The root cause is architectural, not a bug in any single component.

When an agent retrieves memories and passes them to an LLM for response generation, the LLM doesn't treat memories as a bounded dataset. It treats them as context — and LLMs are trained to be maximally helpful with context. They fill gaps. They infer. They extrapolate. That's what they're optimized to do.

So when the memory layer returns "User works in healthcare" and the LLM encounters a billing question, it doesn't ask "is this fact still current?" It asks "how can I use this fact to give the best answer?" Those are fundamentally different questions.

The problem compounds because memory retrieval is similarity-based. A billing question retrieves the healthcare fact because they're contextually related — compliance, regulation, data handling. The retrieval is working correctly. The LLM's use of it is working correctly. The failure lives in the space between them: there is no contract about what the LLM is allowed to do with a retrieved memory.

In database terms: you'd never let application code mutate a value it only has read access to. But when an LLM reasons over a retrieved memory and presents an inference as if it were a stored fact, that's exactly what's happening — the LLM is writing to the user's mental model using data it was only supposed to read.

There's no type system for memory. No constraint that says: this is a stored fact, not an inference license.

Why Common Approaches Fail

The typical response to this problem is prompt engineering. Teams add instructions like: "Only use retrieved memories as context, do not infer beyond them." This helps at the margins. It does not solve the problem.

Three specific failure patterns that persist even with careful prompting:

The confident inference pattern. The LLM retrieves a preference — "user prefers concise responses" — and infers that the user is an expert who doesn't need explanation. The memory stored a style preference. The LLM promoted it to a capability assumption. The instruction said "don't infer beyond memories" but the LLM's inference happened implicitly, inside the response generation, not as a visible reasoning step.

The stale fact problem. Memory systems store what was true at extraction time. They don't store expiry signals. A memory that was accurate six months ago is retrieved with the same confidence score as one stored yesterday. The LLM has no way to distinguish between them. Prompting it to "be cautious about old memories" requires it to know which memories are old — which it doesn't, because that metadata isn't surfaced.

The aggregation problem. Individual memories are accurate. The LLM aggregates them into a composite picture that was never stored. "User is in healthcare" + "user is cost-sensitive" + "user asked about compliance" becomes "This is a regulated enterprise buyer." Each memory is real. The composite is an inference. The agent acts on the composite.

Prompt instructions operate on the output. The failure is happening during generation, before the output exists.

Mental Model: Memory Is a Read-Only Ledger, Not a Reasoning Surface

Here's the reframe.

Memory and reasoning are two different systems that share an interface and the failure happens when that interface is undefined.

Think of memory as a read-only ledger. It contains entries. Each entry has a timestamp, a confidence score, and a type. The ledger doesn't interpret entries. It doesn't fill gaps between entries. It doesn't project forward from entries. It returns exactly what was written, with full provenance.

Reasoning is a separate process that operates over ledger entries. It can infer. It can aggregate. It can project. But it should be explicitly labeled as doing so — and those inferences should never be fed back into the ledger as if they were stored facts.

The problem in most agent architectures is that this boundary is implicit. The memory system returns facts. The LLM generates a response. The user sees a confident statement. There's no visible handoff between "this is what was stored" and "this is what the LLM inferred from what was stored." Here's the practical version of this model:

Memory layer → returns: fact, type, timestamp, confidence contract: "this was observed"

Reasoning layer → operates on: memory outputs contract: "this is inferred" constraint: inferences are labeled, not persisted as facts

Once you draw this line, two things become clear: First, retrieval and generation need different trust levels. A retrieved memory is an observation. A generated response is an interpretation. The agent's output should distinguish between them — "Based on what I know about you" vs "I'm inferring that" — not blend them into a single confident statement.

Second, the memory system's job is to be a perfect witness, not a helpful advisor. It should return exactly what happened, with exactly the confidence it deserves, and get out of the way. The moment the memory layer starts smoothing, interpolating, or filling gaps — it has crossed from memory into reasoning, and you've lost the boundary entirely.

Production Implications

This boundary failure shows up in three specific ways in production:

Trust erosion over time. Users don't notice one wrong inference. They notice a pattern of the agent seeming to know things it shouldn't, or being confidently wrong about things that changed. That pattern — not any single failure — is what kills user trust in AI agents. By the time the pattern is visible, dozens of bad inferences have already been presented as facts.

Debugging becomes impossible. When a user reports that the agent said something wrong, the first question is: was this stored, or was this inferred? Without a hard boundary and explicit labeling, you cannot answer that question from logs alone. You have to replay the conversation and reconstruct what the LLM likely inferred — which is guesswork.

RAG systems have this problem at scale. Retrieval-augmented generation is the standard architecture. But RAG doesn't enforce the memory/reasoning boundary — it retrieves documents and lets the LLM reason freely over them. In a customer-facing agent where retrieved context includes user history, that free reasoning is a liability. Every response is a blend of retrieval and inference with no label distinguishing them.

Open Problems and Tradeoffs

The memory/reasoning boundary is easy to define conceptually. It's genuinely hard to enforce in practice.

Inference is useful. An agent that only returns stored facts without inference is less helpful. Users want the agent to connect dots. The question isn't whether inference should happen — it's whether it should be labeled and whether it should be persisted. Deciding exactly where helpfulness ends and hallucination risk begins is not a solved problem.

Temporal validity is unsolved at scale. Knowing that a memory is stale requires either an expiry signal at write time (which requires predicting at extraction time how long a fact will be valid — often impossible) or a freshness check at retrieval time (which requires knowing the current state of the world to compare against). Neither is clean.

Composite inferences are invisible. The aggregation problem described above is the hardest to solve. Individual memories are correct. The composite the LLM builds from them is an emergent inference that was never explicitly generated, logged, or auditable. There's no clean architectural answer to this yet — it requires either constraining the LLM's reasoning scope (expensive, imperfect) or logging intermediate reasoning steps (adds latency, complexity).

Practical Recommendations

Explicitly type your memories at write time. Every stored memory should carry a type that signals its certainty and expected lifespan — observed_fact, stated_preference, inferred_context, temporal_fact. The type should gate how the LLM is permitted to use it at retrieval time.

Surface timestamps to the LLM. Don't just retrieve the memory content — retrieve it with its age. "User works in healthcare (stored 6 months ago)" is a meaningfully different input than "User works in healthcare." Let the LLM reason about staleness rather than treating all memories as equally current.

Label inferences in responses. Build response generation prompts that explicitly distinguish between stored facts and inferences. "Based on stored information: X. Based on inference from that: Y." Users deserve to know the difference. More importantly, your logs need to capture it.

Never persist LLM inferences back into the memory store. This is the most critical rule. If the LLM infers something during response generation, that inference should not be written back as a memory. Only observations — things explicitly stated or demonstrated — belong in the ledger.

Audit your retrieval-to-response path. Build tooling that lets you trace: what memories were retrieved, what the LLM was given, what it returned. That trace is the only way to catch boundary violations in production before they accumulate into trust failures.

How Memgram Approaches This

The memory/reasoning boundary is why Memgram stores explicit memory types — fact, preference, goal, skill, event, context, constraint — for every stored candidate. Each memory carries its type, importance score, and timestamp through the full pipeline. When you query via POST /memory/search, you get typed results with provenance, not a blended context blob.

The PipelineTrace captures what was extracted and why — so when your pipeline produces an unexpected output, you can audit exactly what was extracted, what was scored, and what was persisted — the write side is fully visible. The retrieval side logs which memory IDs were returned for a given query via the search logs, though connecting that retrieval to a specific LLM response is still a manual step. The boundary is visible. Which is the precondition for enforcing it.

Memory Extraction Is Itself Probabilistic, And Most Systems Hide That From You

Vibhanshu Karn — Fri, 29 May 2026 11:23:08 GMT

Real-World Failure

You're building a customer support agent. A user says: "I've switched from the Pro plan to the Starter plan, and honestly I'm pretty happy with it." Three things are true in that sentence. The user downgraded. The user is happy about it. And buried in the word "honestly" is a signal that they maybe expected not to be. Your memory system stores: "User is on Starter plan." It dropped the sentiment. It dropped the implication. And it made that decision silently, with no log, no score, no explanation. A week later your agent upsells them on Pro features. The user churns. Nobody on your team knows why. The memory system gave no indication anything went wrong. From its perspective, nothing did.

Why It Happens Technically

Here's the part most teams don't sit with long enough: the step where your memory system decides what to remember is itself an LLM call. That means it is probabilistic by nature. It is not a deterministic parser. It is not a rules engine. It is a language model reading a conversation and making judgment calls about salience, relevance, and importance and like all LLM calls, it can be wrong, inconsistent, and heavily influenced by how the input is framed. Consider what actually happens inside a typical extraction pipeline: The raw conversation goes in. The LLM is prompted to identify "memorable" information. It returns a set of candidates. Those candidates get checked against existing memories. Some are stored. Some are dropped. The agent moves on. The problem is step two. "Memorable" is doing enormous work there. The LLM has to simultaneously answer:

Is this new information or something already known? Is this a fact, a preference, a goal, or context? How important is this relative to everything else in the conversation? Is this something that will matter in a future session?

These are not simple questions. They require reasoning about user intent, temporal relevance, and downstream agent behavior, all in one pass. And the answers change depending on the model, the prompt, the conversation length, and what's already in memory. This is not a bug. It is the fundamental nature of using an LLM for extraction. The problem is pretending otherwise.

Why Common Approaches Fail

Most memory systems treat extraction as a solved step. They run the LLM call, take the output, and move straight to storage and retrieval. The extraction layer is a black box with an input and an output. Nothing in between is surfaced. This creates three specific failure modes in production: Silent drops. The extraction LLM decides a piece of information isn't important enough to store. It never appears in memory. No log, no score, no indication it was even considered. From the outside, it looks like the conversation never happened. Inconsistent classification. The same piece of information, "user prefers dark mode", gets classified as a preference in one run and context in another, depending on surrounding conversation. Downstream retrieval treats these differently. Your agent behaves inconsistently across sessions with no traceable cause. Importance score drift. Most systems that do assign importance scores don't expose them. A memory with a score of 0.51 gets stored. One with 0.49 gets dropped. That 0.02 difference is arbitrary at that precision level, and you have no way to audit it, adjust it, or even know it happened. The common response to these failures is to tune the extraction prompt. But prompt tuning without visibility into what the extraction step is actually doing is guesswork. You're adjusting inputs without being able to observe outputs at the candidate level.

Mental Model: The Extraction Step Is a Reasoning Layer, Not a Parser

Here's the reframe that changes how you think about memory system design. Stop thinking of extraction as a parsing step, something that pulls structured data from unstructured text. Start thinking of it as a reasoning layer. One that makes judgment calls under uncertainty, like every other LLM call in your stack. Once you accept that framing, the implications follow naturally: It needs to be observable. Every other reasoning layer in your system is observable. You log LLM calls. You trace tool use. You inspect chain-of-thought. The extraction step deserves the same treatment, not because it will always be wrong, but because you need to know when it is. It needs confidence signals. A binary store/don't-store decision throws away information. A good extraction layer should produce a scored candidate for every piece of information it considers, including the ones it rejects. The score is the signal. A candidate rejected at 0.3 tells you something different than one rejected at 0.48. It needs to expose its reasoning. The LLM doing the extraction has a reason for every decision it makes. That reasoning exists in the completion. Most systems discard it. Keeping it and making it queryable is the difference between a memory system you can debug and one you can only observe from the outside. Think of it like this: if your inference layer hallucinated and you had no logs, no traces, no token probabilities, just the final output, you'd consider that unacceptable for production. The extraction layer in most memory systems is exactly that. Final output only. No trace. No reasoning. No scores.

Production Implications

The practical consequence of opaque extraction shows up in a specific way in production: you cannot distinguish between "the user never said that" and "the system decided not to store it." Both look identical from the agent's perspective. Both look identical in your logs. The only difference is what actually happened and without extraction-level visibility, you have no way to know. This matters most in three scenarios: Debugging agent behavior. When your agent acts on wrong or missing information, the first question is always: was this ever stored? Without candidate-level logs, you can't answer that. You're left replaying conversations manually and guessing. Regulated environments. Healthcare, legal, financial etc, anywhere you need to demonstrate that your agent's memory is accurate and auditable. "We store what the LLM decides to store" is not an answer that survives compliance review. Multi-session consistency. Users expect agents to remember things across sessions. When they don't, users notice immediately. Diagnosing why requires knowing what the extraction layer saw, scored, and decided. Not just what ended up in the store.

Open Problems and Tradeoffs

It's worth being honest about what's still unsolved here. Extraction quality is model-dependent. A stronger model makes better extraction decisions. But stronger models cost more and are slower. There's a real tradeoff between extraction fidelity and pipeline latency that every team has to navigate. There's no universally right answer. Importance scoring is still noisy at the margins. The difference between a score of 0.45 and 0.55 is meaningful in aggregate but arbitrary for any single candidate. Threshold tuning helps but doesn't eliminate the noise. This is an area where the field hasn't converged on a good solution yet. Extraction instructions are the primary tuning lever and they're fragile. The prompt that drives extraction is doing a lot of work. Small changes can have large downstream effects on what gets remembered. Teams that invest in this tuning see better results, but it requires iteration and visibility into what's changing.

Practical Recommendations

If you're building or evaluating a memory system, here's what to actually look for: Demand candidate-level logging. Every piece of information the extraction layer considered should be logged and not just what was stored. If your memory system can't show you what it rejected and why, you're flying blind. Treat importance scores as first-class data. Don't let the system make binary decisions and hide the scores. Store them. Query them. Use them to tune thresholds over time with real data rather than intuition. Separate extraction from storage in your mental model. They're different problems with different failure modes. Extraction is a reasoning problem. Storage is an engineering problem. Conflating them makes both harder to debug. Build for auditability from day one. Retrofitting observability into a memory system is painful. If you're building your own, instrument the extraction layer before you instrument anything else. If you're evaluating third-party systems, ask to see a pipeline trace before you ask about latency numbers.

How Memgram Approaches This

This problem is specifically why Memgram was built with a transparent extraction pipeline. Every POST /memory/add call produces a PipelineTrace. For each candidate the extraction LLM considers, Memgram stores the content, memory type, importance score (0–1), dedup decision, final decision, rejection reason, and the full LLM reasoning text. Candidates that were dropped are logged alongside candidates that were stored. You can see exactly what the system saw, what it scored, and why it made the decision it made. That trace is queryable via GET /trace/:id. In the dashboard, you can click into any memory write, see every candidate, and drill into the LLM reasoning for each one. It doesn't eliminate the probabilistic nature of extraction. Nothing does. But it makes that uncertainty visible — which is the precondition for doing anything about it.