Engineering Notes · LLM Memory Systems
Memory in LLMs
Short-term vs long-term, architectures, patterns, tradeoffs, and production hacks.
Overview
Short-term
Long-term
Methods
Use cases
Bottlenecks
Hacks
Best practices
User input
prompt / message
→
Context window
short-term memory
↔
External stores
long-term memory
→
LLM output
response
IN-CONTEXT Short-term
Everything inside the current context window. Zero-latency, zero-setup. Wiped when session ends. Bounded by token limit (4K–2M tokens depending on model).
RETRIEVAL Long-term RAG
Vector DBs, BM25, semantic search. Relevant chunks retrieved at query time and injected into context. Scales to millions of docs but adds latency + complexity.
PARAMETRIC In-weights
Knowledge baked into model weights via pre-training or fine-tuning. Cannot be updated at inference. Fast retrieval, but static — stale knowledge is a hard problem.
STRUCTURED External DB
SQL/NoSQL/KV stores the model reads/writes via tool calls. User profile DBs, session stores, knowledge graphs. Explicit, queryable, updateable.
| Dimension | Short-term | Long-term |
| Storage | Context window (RAM-like) | Disk/DB (persistent) |
| Lifetime | One session | Persists across sessions |
| Capacity | Thousands of tokens | Millions of records |
| Latency | ~0ms (already loaded) | 5ms–500ms (retrieve) |
| Update cost | Free (just append) | Indexing, embeddings |
| Retrieval | Full attention over all tokens | Approximate / semantic |
| Forgetting | Abrupt (token limit hit) | Gradual (controlled) |
The context window [system] + [history] + [retrieved_chunks] + [user_turn] is the LLM's working memory. The model has perfect attention over every token in it — no retrieval needed. The bottleneck is purely token count and cost.
📌
System promptPersona, rules, tools, current date/time. Usually static per deployment.
💬
Conversation historyAll prior turns. Grows each message. First thing compressed/truncated when limit approaches.
📄
Retrieved contextChunks from RAG injected before the user's question. Temporary — injected fresh every turn.
🛠
Tool call resultsOutputs from function calls, search results, code execution output.
🖼
Multimodal contentImages, PDFs (expanded to thousands of tokens each).
🔢
Scratchpad / CoTChain-of-thought tokens, extended thinking blocks.
# Strategy 1: simple truncation (bad for coherence)
messages = messages[-MAX_TURNS:]
# Strategy 2: keep system + last N + summarize middle
def compress_history(messages, max_tokens=6000):
system = messages[0]
recent = messages[-6:] # always keep last 3 turns
middle = messages[1:-6]
summary = summarize(middle) # LLM summarization call
return [system, {"role":"system", "content":summary}] + recent
# Strategy 3: token-budget sliding window
while count_tokens(messages) > MAX_TOKENS:
messages.pop(1) # drop oldest non-system message
| Model | Context | Notes |
| GPT-4o | 128K tokens | ~96K usable in practice |
| Claude 3.7 / Sonnet 4 | 200K tokens | ~150K economic sweet spot |
| Gemini 1.5 Pro | 1M tokens | Degraded recall past 200K |
| Gemini 2.0 Ultra | 2M tokens | Best long-ctx recall so far |
| Llama 3.3 70B | 128K tokens | Open weights, self-hosted |
Anything that persists beyond a single context window. Implemented as an external system the LLM reads from (retrieval) or writes to (tool call). The model itself never "remembers" — it reads a summary/chunk injected into context each turn.
VECTOR DB Semantic retrieval
Embed text → store vectors → ANN search at query time. Best for fuzzy, semantic lookup.
Pinecone Qdrant Weaviate pgvector ChromaDB
KV STORE Exact lookup
Redis / DynamoDB for fast exact-key retrieval. User profile, session state, preferences. O(1) lookup, no semantic search.
Redis DynamoDB MongoDB
GRAPH DB Relational memory
Entities + relationships. Best for knowledge graphs, multi-hop reasoning ("Alice works with Bob who knows Carol").
Neo4j Memgraph Graphiti
RELATIONAL Structured data
SQL for tabular user data. LLM generates SQL via Text2SQL or retrieves via structured filters.
PostgreSQL SQLite Supabase
# After each conversation turn — extract + store memories
async def persist_memory(user_id, turn_text, llm):
# 1. Extract memorable facts from the turn
facts = await llm.extract(
prompt=f"Extract key facts from: {turn_text}",
schema="List[{fact: str, importance: 1-5}]"
)
# 2. Deduplicate against existing memories
existing = await vector_db.search(query=facts, top_k=5)
novel_facts = deduplicate(facts, existing)
# 3. Embed + upsert
embeddings = await embedder.embed_batch(novel_facts)
await vector_db.upsert(
vectors=embeddings,
metadata={"user_id": user_id, "timestamp": now()}
)
# At the start of each turn — retrieve + inject
async def build_context(user_id, user_query):
query_embedding = await embedder.embed(user_query)
memories = await vector_db.search(
vector=query_embedding,
filter={"user_id": user_id},
top_k=5
)
memory_block = "\n".join([m.text for m in memories])
return f"[Relevant memories]\n{memory_block}\n\n[User query]\n{user_query}"
Click to expand each method