Memory is what separates a chatbot from an assistant. An agent without memory is doomed to repeat questions, lose context mid-task, and feel deeply forgettable. This guide breaks down the memory architectures that production AI agents actually use, and how to wire them together.
Why Memory Matters
The LLM's context window is a temporary scratchpad — wipe it and the agent starts from zero. Real applications need:
- Continuity across sessions
- Personalization based on past interactions
- Knowledge that exceeds the context window
- Audit trails for compliance and debugging
The Four Layers of Agent Memory
1. Working Memory (Short-Term)
The active context window. Holds the current conversation, recent tool calls, and the running scratchpad. Fast, ephemeral, expensive per token.
2. Episodic Memory
A chronological log of past interactions. Lets the agent recall "what happened last Tuesday." Typically a relational or document database keyed by user and timestamp.
CREATE TABLE agent_episodes (
id UUID PRIMARY KEY,
user_id UUID NOT NULL,
session_id UUID NOT NULL,
role TEXT NOT NULL,
content TEXT NOT NULL,
tool_calls JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON agent_episodes (user_id, created_at DESC);
3. Semantic Memory
Embeddings of facts, documents, and past insights stored in a vector database. The agent retrieves relevant chunks at runtime via similarity search.
from openai import OpenAI
import pinecone
client = OpenAI()
index = pinecone.Index("agent-memory")
def remember(user_id: str, fact: str):
emb = client.embeddings.create(
model="text-embedding-3-small",
input=fact
).data[0].embedding
index.upsert([(uuid4().hex, emb, {"user_id": user_id, "text": fact})])
def recall(user_id: str, query: str, k: int = 5):
emb = client.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
return index.query(vector=emb, top_k=k,
filter={"user_id": user_id}, include_metadata=True)
4. Procedural Memory
Learned skills and tool-use patterns. In practice this looks like prompt templates, few-shot examples, and successful trace replays. Some teams compile this into LoRA adapters or fine-tuned weights.
Memory Compression
Raw history grows fast. Compression keeps the context affordable:
- Summarization: roll older turns into a one-paragraph summary
- Sliding window: keep last N turns verbatim, summarize the rest
- Hierarchical summaries: summaries of summaries for very long sessions
- Selective retrieval: only inject memory relevant to the current query
Choosing a Vector Store
Popular options for semantic memory:
- Pinecone: managed, scales effortlessly, hybrid search
- Weaviate: open-source, GraphQL API, built-in modules
- pgvector: Postgres extension, perfect when you already use Postgres
- Qdrant: fast, Rust-based, generous free tier
Privacy and Forgetting
Memory creates compliance obligations. Build forgetting in from day one:
- Per-user isolation in every query
- Hard-delete endpoints that purge episodic and vector storage
- TTLs on sensitive categories (PII, payment info)
- Encryption at rest and in transit
Evaluation
Test memory like any other system. Build scenarios that exercise recall:
- "Remember my name" → cross-session retrieval
- "What did we agree last week?" → episodic accuracy
- "Tell me about company policy X" → semantic search precision
Conclusion
Memory is the difference between a demo and a product. Layer working, episodic, semantic, and procedural memory — compress aggressively, retrieve selectively, and respect user data. Get it right and your agent stops feeling like a stateless chatbot and starts feeling like a colleague.