Why Context Engineering Replaced Prompt Engineering
In 2023 everyone wanted to be a prompt engineer. In 2026 that job title sounds a little quaint. The frontier models are smart enough that writing clever instructions is rarely the bottleneck. The bottleneck is deciding what information the model sees at each step, where that information comes from, how long it sticks around, and how it gets compressed when things get long.
That discipline is context engineering, and memory is its core problem. An agent that forgets who the user is between turns feels broken. An agent that stuffs every previous message into the prompt gets slow, expensive, and confused. Somewhere between those failure modes is the sweet spot, and finding it is what separates a flashy demo from a production system customers actually rely on.
This guide is the playbook we use at Kanopy Labs when we ship agents. It is opinionated on purpose. If you want a neutral survey, there are plenty of papers. If you want to know what actually works when real users are typing at your product at 2am, keep reading.
The Short-Term Context Window: Stop Treating It Like Infinite
Modern models advertise million token context windows. This has convinced a lot of teams that they can simply append every message, every tool call, and every retrieved document to the prompt forever. That is a trap.
Three things go wrong when you overload the window:
- Attention degrades. The "lost in the middle" effect is real. Information buried in a 400K token prompt gets ignored even when the model technically can see it.
- Latency explodes. Time to first token scales roughly linearly with input size. A 200K token prompt means multi second waits before the user sees anything, and agent loops amplify that pain.
- Cost compounds. Every tool call in an agent loop replays the entire context. A 50 step agent with a bloated window can burn through hundreds of dollars per session.
The practical rule we follow: treat the context window like RAM, not disk. It is fast, expensive, and volatile. You want only what the model needs for the next decision. Everything else lives somewhere cheaper and gets pulled in on demand. Our guide to agentic workflows goes deeper on how this interacts with tool calling loops.
Sliding Windows and Semantic Compression
The simplest short-term memory strategy is a sliding window. Keep the last N messages verbatim, drop anything older. This works surprisingly well for casual chat where only recent turns matter. It falls apart the moment a user references something from earlier in the conversation.
The upgrade is semantic compression. Instead of dropping old messages, you summarize them. The canonical pattern looks like this:
- Recent turns stay raw. The last 6 to 10 exchanges are kept verbatim so the model has full fidelity for the immediate conversation.
- Older turns get rolled into a running summary. A cheaper model (Haiku, GPT-4o-mini, Llama 3.3 8B) condenses outgoing messages into bullet points that preserve facts, decisions, and user preferences.
- Critical facts are pinned. User name, current task, constraints, and anything the user explicitly said to remember get stored in a separate system block that never rotates out.
LangGraph ships this pattern as a first class concept through its checkpointer interface. Each node in the graph can read and write to a persistent state store, and you get free resumability if a tool call crashes mid run. If you are building on LangChain, skip the raw message history abstractions and go straight to LangGraph. The checkpointer model is the right mental shape for agents.
For teams not on LangGraph, the same pattern works with a plain Postgres table keyed by session id. Do not overengineer this. A jsonb column plus an updated_at index handles surprising amounts of traffic.
Long-Term Memory: Vector, Summarization, and Episodic
Short-term memory handles one conversation. Long-term memory is what makes an agent feel like it knows you across sessions, days, and weeks. There are three flavors you should understand, and a production system typically uses all three.
Semantic memory (vector store). Facts and preferences, embedded and indexed. "User prefers metric units." "User is on the enterprise plan." "User asked us to always cc their assistant." You write these as the conversation unfolds and retrieve them by similarity at the start of each new session. This is the workhorse and where most teams start.
Summarization memory. Narrative recaps of past sessions. Rather than retrieving raw snippets, you pull the summary of the last three conversations and prepend it. This is cheap, human readable, and debuggable. The downside is that summaries drift. Facts get rounded off, nuance evaporates, and over many cycles the model starts to believe things the user never actually said.
Episodic memory. Full transcripts or structured event logs of specific interactions, retrieved when relevant. If the user says "like we did last Tuesday," you want to pull up last Tuesday's session verbatim. Episodic memory is heavier to store but irreplaceable for agents that reference their own history.
The mistake we see most often is teams picking one flavor and forcing it to do everything. Vector memory alone cannot capture temporal relationships. Summarization alone loses specificity. Use them together and let each do what it is good at.
The Tooling Landscape: Anthropic, OpenAI, mem0, Letta
You do not have to build this from scratch. The ecosystem in 2026 is crowded but finally mature. Here is how we actually think about the main options.
Anthropic memory tool. Claude now exposes a first party memory tool that lets the model read and write to a persistent store on its own initiative. The killer feature is that the model decides what is worth remembering, which dramatically reduces the amount of orchestration code you write. The tradeoff is less control. If you need strict auditing of what the agent remembers, you still want a layer on top.
OpenAI Assistants memory. The Assistants API handles thread persistence and has built in memory features that work well for straightforward chat. It is the fastest path to a working prototype. It is also the hardest to eject from if you later need custom behavior, so weigh lock in carefully.
mem0. An open source memory layer that sits in front of any LLM. It does extraction, embedding, storage, and retrieval with a clean API. We reach for mem0 when we want portable memory that works across model providers and when the client does not want to be tied to a single vendor's memory primitive. It plays nicely with Postgres, Qdrant, and Pinecone.
Letta (formerly MemGPT). Takes the OS metaphor seriously. Letta treats memory like paged virtual memory, swapping between a working context and a larger archival store. It is the most sophisticated option and also the heaviest. We use it for agents that need to maintain truly long horizons, like research assistants that run for hours or days.
LangGraph checkpointers. Not a memory product per se, but the state management primitive we recommend for any agent more complex than a single turn chat. Pair it with mem0 or a custom store for long-term memory and you have a clean separation between "what is this agent currently doing" and "what does this agent know about the user."
Our honest default stack for a new client: LangGraph for orchestration and checkpointing, mem0 for user level long-term memory, Postgres with pgvector for the underlying store, and Claude or GPT-4 class models for the actual reasoning. It is boring, which is the highest compliment we give an architecture.
Retrieval Grounding and the Recency Problem
Memory retrieval is not the same as RAG over your documents, even though the mechanics look similar. RAG pulls facts from a corpus you control. Memory pulls facts about a specific user or agent history. The failure modes are different and the ranking logic has to be different too.
The biggest difference is recency. In document RAG, a chunk written three years ago is often just as valid as one written yesterday. In memory, freshness matters enormously. If a user said "I hate dark mode" in January and "actually turn on dark mode" in March, you absolutely need the March memory to win. Pure cosine similarity will not do that for you.
Three techniques we use to get retrieval grounding right:
- Recency boosting. Combine similarity score with an exponential decay on timestamp. Recent memories get a multiplicative bonus that tapers off over weeks.
- Contradiction detection. When writing a new memory, check whether it conflicts with an existing one. If it does, mark the old one as superseded rather than deleting it. This preserves history for debugging and lets you show the user what changed.
- Metadata filtering. Tag memories with type, source, and confidence. At retrieval time, filter aggressively before you even look at similarity. A memory that came from the user directly should beat one the agent inferred.
These are the details that separate a memory system that wows in a demo from one that holds up after a thousand real conversations. We wrote more about retrieval tradeoffs in our fine tuning versus RAG comparison if you want to go deeper on the underlying mechanics.
Evaluation: How You Know the Memory Is Actually Working
Here is the uncomfortable truth about memory systems: they break silently. A forgotten fact does not throw an exception. A stale summary does not crash the process. Users just get a slightly worse experience and eventually churn. If you are not measuring memory quality, you are flying blind.
The evaluation framework we use has four layers:
- Unit tests on memory operations. For every write and retrieve, assert that the expected content ends up in the store and comes back out. Boring, essential, often skipped.
- Scripted multi turn scenarios. Canned conversations where the user establishes a fact early, changes topics, then references the fact twenty turns later. Pass or fail based on whether the agent recalls it correctly.
- LLM as judge on real traces. Sample production conversations, have a stronger model grade whether the agent used memory appropriately, whether it hallucinated something it should have checked, and whether it remembered things the user asked it to forget.
- User signals. Thumbs up and down, explicit "you got that wrong" feedback, and regret events like users re explaining something the agent should have known. These are the ground truth.
On top of that, track a small number of operational metrics: memory hit rate at retrieval time, average memories injected per turn, and context window utilization. When any of those drift, something upstream has changed and you want to know before the users do.
One more thing. Build a memory inspector from day one. A simple admin view that shows exactly what the agent remembers about a specific user, with timestamps and sources, pays for itself the first time a customer asks "why does your AI think I live in Denver." We have never regretted building this, and we have regretted not building it more times than I want to admit.
Privacy, Forgetting, and the Governance Layer
Memory that cannot be deleted is a liability. Once you start storing user facts across sessions, you have inherited a data protection problem. GDPR, CCPA, and every serious enterprise contract now include language about AI memory specifically. If your answer to "delete everything you know about me" is "um," you are not production ready.
Design for forgetting from the start:
- Hard delete path. A single function that removes every memory for a given user id from every store, including vector indexes and summary caches. Test it regularly.
- Soft expiration. Memories older than a configurable threshold get purged automatically unless explicitly pinned. Most users do not want an agent that remembers a throwaway comment from two years ago.
- Scope boundaries. A memory written during a work session should not leak into a personal session. Tag everything with scope and filter at retrieval. This is especially critical for multi agent systems where different agents may share a memory store.
- User visibility. Let users see what the agent remembers about them and edit or delete individual entries. This is table stakes for enterprise deals in 2026.
Treat the memory store with the same seriousness as your user database. It is not a cache. It is PII.
Putting It All Together
If you are starting a new agent project today, here is the order of operations we would follow. First, build the agent with a simple sliding window and no long-term memory at all. Ship it, watch real users, find out what they actually try to do across sessions. Most teams discover their memory requirements are different from what they guessed.
Second, add a long-term memory layer backed by mem0 or a minimal Postgres plus pgvector store. Write user facts aggressively, retrieve them with recency boosted similarity, and pin critical state. Add the memory inspector at the same time so you can see what is happening.
Third, layer in semantic compression for the short-term window. This is where you get real cost and latency wins, and it is also where the most subtle bugs live. Do not skip evaluation here.
Fourth, build your eval harness. Scripted scenarios, LLM judge, user signals. Run it on every deploy. Track memory hit rate as a first class metric.
Finally, harden governance. Delete paths, scope boundaries, user visibility. Do this before you sign your first enterprise contract, not after.
Context engineering is not glamorous work. It is plumbing. But the agents that feel magical in 2026 are the ones where the plumbing is right, and the ones that feel frustrating are almost always the ones where somebody skipped these steps in a hurry to ship.
If you want a team that has done this dance a dozen times and can save you the expensive detours, we would love to help. Book a free strategy call and we will walk through your use case, sketch a memory architecture, and tell you honestly whether you should build, buy, or wait.
Need help building this?
Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.