Every Memory System Works in Week One

Real-World Failure

Your team ships a customer success agent in Q1. The demo is clean. Early users love it. The agent remembers context, references past conversations, feels genuinely personalized.

By Q3, support tickets start coming in. The agent is referencing products the company sunset in April. It's addressing users by job titles they no longer hold. It's surfacing pricing from a plan structure that was reorganized in June.

Nobody changed the memory system. Nobody touched the extraction pipeline. The agent is retrieving correctly — returning the highest-scoring memories for each query. The memories are just wrong. They were accurate when they were written.

No error was thrown. No alert fired. The system degraded silently, one stale memory at a time, over six months. The only signal was user complaints.

Why It Happens Technically

Memory systems are optimized for write correctness and retrieval relevance. Neither of these properties has anything to do with whether a memory is still valid.

When a memory is written, it gets an importance score, a type classification, and a timestamp. That's it. There's no validity window. There's no expiry condition. There's no mechanism that asks — at retrieval time — whether the world has changed since this memory was created.

The retrieval path makes this worse. Semantic similarity search surfaces memories that are topically relevant to the current query. A memory about a user's job title is highly relevant to a query about their work context — regardless of whether that job title is six days old or six months old. Relevance and validity are orthogonal properties. Optimizing for one does nothing for the other.

Here's the deeper problem: staleness is invisible at the memory level. A stale memory looks identical to a fresh one. Same structure, same fields, same retrieval score. The only thing that differs is the relationship between the stored value and the current state of the world — and the memory system has no model of the current state of the world. It only knows what was true when the memory was written.

This is why degradation is silent. The system isn't failing. It's doing exactly what it was designed to do. It's retrieving the most relevant memories. It has no way to know those memories stopped being true.

Why Common Approaches Fail

The recency weighting pattern. Teams add time decay to retrieval scoring — memories written recently score higher than older ones. This suppresses stale facts organically over time. The failure: it also suppresses facts that are stable and permanent. A user's core constraint — "I never invest in crypto" — set once three years ago and never mentioned again, scores lower than a transient observation from last week. Recency weighting conflates silence with staleness. They're different signals.

The re-extraction pattern. Some implementations periodically re-run extraction over recent conversations, letting new extractions overwrite old ones by similarity threshold. This catches explicit updates — "I moved to Berlin" overwrites "I live in Austin." It misses implicit invalidation. Nobody says "that product I mentioned is discontinued." They just stop talking about it. Re-extraction without external invalidation signals misses the entire class of world-change staleness.

The user feedback pattern. Teams surface a "was this helpful?" signal and use negative feedback to flag stale memories. This catches staleness after it causes a visible failure. It doesn't catch the hundreds of times a stale memory informed agent behavior without the user noticing or bothering to report it. Feedback-driven invalidation is reactive by design. Silent degradation is silent precisely because most instances never generate feedback.

Mental Model: The Validity Gap

Here's the reframe.

Every memory system has two timelines running simultaneously:

MEMORY TIMELINE → when facts were written, updated, superseded

WORLD TIMELINE → when the underlying reality those facts describe actually changed

Staleness is the gap between these two timelines.

When a user gets promoted, the world timeline updates immediately. The memory timeline updates only if the agent learns about the promotion in a subsequent conversation and the extraction pipeline correctly identifies and stores it. If the user never mentions it, or mentions it in passing and the extraction pipeline misses it, the memory timeline stays behind. The gap widens.

The critical insight: you cannot close this gap from inside the memory system alone. The memory system only knows what it's been told. It has no independent access to ground truth. It cannot poll the world to check whether its memories are still valid.

This means staleness is not primarily an engineering problem — it's an information problem. The memory system can only be as current as the information flow that feeds it. Engineering solutions — TTLs, decay functions, re-extraction — are approximations that reduce the gap. They don't close it.

What good memory architecture accepts about this reality:

Model the gap explicitly. Every memory should carry not just a creation timestamp but a last_confirmed_at field — the last time this fact was either restated or implicitly confirmed by the user. A memory that hasn't been confirmed in ninety days is a candidate for staleness review, not automatic deletion.

Separate validity from relevance at retrieval time. Retrieval should score two things independently: how relevant is this memory to the current query, and how confident are we that it's still valid. These should be separate signals that combine into a final retrieval score — not collapsed into a single relevance score that ignores validity entirely.

Make the gap visible. When an agent retrieves a memory that hasn't been confirmed in a long time, that uncertainty should surface — either in the agent's response ("based on what you told me six months ago...") or in the system's internal confidence score. Silent confidence in stale facts is the failure mode. Explicit uncertainty is the correct behavior.

Production Implications

SaaS customer success agents. Product catalogs change. Pricing changes. Feature sets change. An agent that learned about your product offering in January and has no mechanism to invalidate that knowledge will confidently describe features that no longer exist and miss features that were launched since. The agent becomes a liability as the product evolves — not because its memory is wrong, but because nobody told it the world changed.

Personal assistant agents over long time horizons. Users' lives change. Jobs, relationships, health situations, financial circumstances — all of these evolve over months and years. An agent operating over a multi-year horizon accumulates a growing set of memories, an increasing fraction of which describe a reality that no longer exists. Without validity modeling, the agent becomes less accurate the longer it runs — the exact opposite of the intended behavior.

Regulated environments with audit requirements. A compliance agent in financial services needs to act on current regulatory state, not regulatory state from six months ago. When regulators ask "what information was your agent acting on when it gave this advice?" the answer needs to be accurate and current. A stale memory that informed a compliance decision is not a minor UX issue — it's a liability.

Open Problems and Tradeoffs

World-change detection is unsolved at scale. Detecting that the world changed requires either the user to explicitly report the change, or an external data source to signal it, or an inference from conversation that something is no longer true. The first is unreliable. The second requires integrations that most memory systems don't have. The third is an LLM inference call that can be wrong and adds pipeline complexity.

TTLs are blunt but they're something. Time-based expiry is wrong in theory — validity isn't a function of time — but it's better than nothing in practice. A memory that's never been confirmed in two years probably deserves scrutiny even if it's technically still valid. The tradeoff is false positives: valid memories that happen to be stable and unmentioned get flagged alongside genuinely stale ones. There's no clean solution, only tuning.

Confirmation signals are sparse. The last_confirmed_at model requires the system to recognize implicit confirmations — a user acting consistently with a stored preference, referencing a stored fact in passing. Detecting these signals reliably requires LLM inference on every conversation, which adds latency and cost. The richer the validity model, the more expensive the pipeline.

Practical Recommendations

Add last_confirmed_at to every memory. Update it when the fact is restated, implicitly confirmed, or acted on by the user. Query it at retrieval time. A memory with a last_confirmed_at of ninety-plus days is a staleness candidate — flag it, don't delete it.

Separate relevance and validity in your retrieval scoring. Retrieval score should be a function of both semantic relevance and recency of confirmation — not just semantic relevance. A memory that's highly relevant but hasn't been confirmed in six months should score differently than one confirmed last week.

Run a staleness audit on a real user after thirty days. Pull all active memories. Read them as if you're the agent. Ask how many are still true. The answer will tell you more about your degradation rate than any metric will — and it will tell you quickly.

Surface uncertainty in agent responses for old memories. When retrieving a memory older than a configurable threshold, instruct the agent to hedge: "Based on what you mentioned previously..." rather than stating it as current fact. This converts silent confidence into explicit uncertainty, which users can correct.

Treat external events as invalidation signals. If your agent operates in a domain with external data sources — product catalogs, org charts, regulatory databases — pipe updates from those sources through the memory system as invalidation events. A product deprecation in your catalog should mark related memories as requiring review, not wait for a user complaint.

How Memgram Approaches This

Memgram stores a superseded_at timestamp on every fact that gets replaced — so the write-side timeline is fully auditable. The PipelineTrace captures what the extraction layer saw and decided in each run, which means when a world-change event arrives in a conversation and the system misses it, the trace shows exactly what was considered and why the invalidation wasn't triggered.

The validity gap between memory timeline and world timeline isn't closed by logging. But making the write-side decisions visible is the precondition for doing anything about it. You said: prompt for image for chatgpt

Every Memory System Works in Week One

Real-World Failure

Why It Happens Technically

Why Common Approaches Fail

Mental Model: The Validity Gap

Production Implications

Open Problems and Tradeoffs

Practical Recommendations

How Memgram Approaches This

Comments

More from this blog

Storage Isn't the Hard Part Anymore. Deciding What Stays Relevant Is.

You Can Trace Every LLM Call. You Have No Idea What Your Agent Remembers.

Long Context Windows Are Not a Memory System

Your Agent Remembers Everything. It Has No Idea What's Still True.

Command Palette

Real-World Failure

Why It Happens Technically

Why Common Approaches Fail

Mental Model: The Validity Gap

Production Implications

Open Problems and Tradeoffs

Practical Recommendations

How Memgram Approaches This

Comments

More from this blog