Engineering Notes · LLM Memory Systems
Memory in LLMs
Short-term vs long-term, architectures, patterns, tradeoffs, and production hacks.
Mental model
User input prompt / message
Context window short-term memory
External stores long-term memory
LLM output response
Four memory types
IN-CONTEXT Short-term
Everything inside the current context window. Zero-latency, zero-setup. Wiped when session ends. Bounded by token limit (4K–2M tokens depending on model).
RETRIEVAL Long-term RAG
Vector DBs, BM25, semantic search. Relevant chunks retrieved at query time and injected into context. Scales to millions of docs but adds latency + complexity.
PARAMETRIC In-weights
Knowledge baked into model weights via pre-training or fine-tuning. Cannot be updated at inference. Fast retrieval, but static — stale knowledge is a hard problem.
STRUCTURED External DB
SQL/NoSQL/KV stores the model reads/writes via tool calls. User profile DBs, session stores, knowledge graphs. Explicit, queryable, updateable.
Key distinction
DimensionShort-termLong-term
StorageContext window (RAM-like)Disk/DB (persistent)
LifetimeOne sessionPersists across sessions
CapacityThousands of tokensMillions of records
Latency~0ms (already loaded)5ms–500ms (retrieve)
Update costFree (just append)Indexing, embeddings
RetrievalFull attention over all tokensApproximate / semantic
ForgettingAbrupt (token limit hit)Gradual (controlled)
What is short-term memory?
The context window [system] + [history] + [retrieved_chunks] + [user_turn] is the LLM's working memory. The model has perfect attention over every token in it — no retrieval needed. The bottleneck is purely token count and cost.
What lives in the context window
📌
System promptPersona, rules, tools, current date/time. Usually static per deployment.
💬
Conversation historyAll prior turns. Grows each message. First thing compressed/truncated when limit approaches.
📄
Retrieved contextChunks from RAG injected before the user's question. Temporary — injected fresh every turn.
🛠
Tool call resultsOutputs from function calls, search results, code execution output.
🖼
Multimodal contentImages, PDFs (expanded to thousands of tokens each).
🔢
Scratchpad / CoTChain-of-thought tokens, extended thinking blocks.
Sliding window patterns
# Strategy 1: simple truncation (bad for coherence) messages = messages[-MAX_TURNS:] # Strategy 2: keep system + last N + summarize middle def compress_history(messages, max_tokens=6000): system = messages[0] recent = messages[-6:] # always keep last 3 turns middle = messages[1:-6] summary = summarize(middle) # LLM summarization call return [system, {"role":"system", "content":summary}] + recent # Strategy 3: token-budget sliding window while count_tokens(messages) > MAX_TOKENS: messages.pop(1) # drop oldest non-system message
Context window sizes (2025)
ModelContextNotes
GPT-4o128K tokens~96K usable in practice
Claude 3.7 / Sonnet 4200K tokens~150K economic sweet spot
Gemini 1.5 Pro1M tokensDegraded recall past 200K
Gemini 2.0 Ultra2M tokensBest long-ctx recall so far
Llama 3.3 70B128K tokensOpen weights, self-hosted
Long-term memory architecture
Anything that persists beyond a single context window. Implemented as an external system the LLM reads from (retrieval) or writes to (tool call). The model itself never "remembers" — it reads a summary/chunk injected into context each turn.
Storage backends
VECTOR DB Semantic retrieval
Embed text → store vectors → ANN search at query time. Best for fuzzy, semantic lookup.

Pinecone Qdrant Weaviate pgvector ChromaDB
KV STORE Exact lookup
Redis / DynamoDB for fast exact-key retrieval. User profile, session state, preferences. O(1) lookup, no semantic search.

Redis DynamoDB MongoDB
GRAPH DB Relational memory
Entities + relationships. Best for knowledge graphs, multi-hop reasoning ("Alice works with Bob who knows Carol").

Neo4j Memgraph Graphiti
RELATIONAL Structured data
SQL for tabular user data. LLM generates SQL via Text2SQL or retrieves via structured filters.

PostgreSQL SQLite Supabase
Memory write pipeline
# After each conversation turn — extract + store memories async def persist_memory(user_id, turn_text, llm): # 1. Extract memorable facts from the turn facts = await llm.extract( prompt=f"Extract key facts from: {turn_text}", schema="List[{fact: str, importance: 1-5}]" ) # 2. Deduplicate against existing memories existing = await vector_db.search(query=facts, top_k=5) novel_facts = deduplicate(facts, existing) # 3. Embed + upsert embeddings = await embedder.embed_batch(novel_facts) await vector_db.upsert( vectors=embeddings, metadata={"user_id": user_id, "timestamp": now()} )
Memory read pipeline
# At the start of each turn — retrieve + inject async def build_context(user_id, user_query): query_embedding = await embedder.embed(user_query) memories = await vector_db.search( vector=query_embedding, filter={"user_id": user_id}, top_k=5 ) memory_block = "\n".join([m.text for m in memories]) return f"[Relevant memories]\n{memory_block}\n\n[User query]\n{user_query}"
Implementation methods

Click to expand each method

Architecture by use case
Known problems
Production hacks & tricks
Best practices