LLM Memory Systems — Engineering Notes

Mental model

User input prompt / message

→

Context window short-term memory

↔

External stores long-term memory

→

LLM output response

Four memory types

IN-CONTEXT Short-term

Everything inside the current context window. Zero-latency, zero-setup. Wiped when session ends. Bounded by token limit (4K–2M tokens depending on model).

RETRIEVAL Long-term RAG

Vector DBs, BM25, semantic search. Relevant chunks retrieved at query time and injected into context. Scales to millions of docs but adds latency + complexity.

PARAMETRIC In-weights

Knowledge baked into model weights via pre-training or fine-tuning. Cannot be updated at inference. Fast retrieval, but static — stale knowledge is a hard problem.

STRUCTURED External DB

SQL/NoSQL/KV stores the model reads/writes via tool calls. User profile DBs, session stores, knowledge graphs. Explicit, queryable, updateable.

Key distinction

Dimension	Short-term	Long-term
Storage	Context window (RAM-like)	Disk/DB (persistent)
Lifetime	One session	Persists across sessions
Capacity	Thousands of tokens	Millions of records
Latency	~0ms (already loaded)	5ms–500ms (retrieve)
Update cost	Free (just append)	Indexing, embeddings
Retrieval	Full attention over all tokens	Approximate / semantic
Forgetting	Abrupt (token limit hit)	Gradual (controlled)

What is short-term memory?

The context window [system] + [history] + [retrieved_chunks] + [user_turn] is the LLM's working memory. The model has perfect attention over every token in it — no retrieval needed. The bottleneck is purely token count and cost.

What lives in the context window

📌

System promptPersona, rules, tools, current date/time. Usually static per deployment.

💬

Conversation historyAll prior turns. Grows each message. First thing compressed/truncated when limit approaches.

📄

Retrieved contextChunks from RAG injected before the user's question. Temporary — injected fresh every turn.

🛠

Tool call resultsOutputs from function calls, search results, code execution output.

🖼

Multimodal contentImages, PDFs (expanded to thousands of tokens each).

🔢

Scratchpad / CoTChain-of-thought tokens, extended thinking blocks.

Sliding window patterns

# Strategy 1: simple truncation (bad for coherence) messages = messages[-MAX_TURNS:] # Strategy 2: keep system + last N + summarize middle def compress_history(messages, max_tokens=6000): system = messages[0] recent = messages[-6:] # always keep last 3 turns middle = messages[1:-6] summary = summarize(middle) # LLM summarization call return [system, {"role":"system", "content":summary}] + recent # Strategy 3: token-budget sliding window while count_tokens(messages) > MAX_TOKENS: messages.pop(1) # drop oldest non-system message

Context window sizes (2025)

Model	Context	Notes
GPT-4o	128K tokens	~96K usable in practice
Claude 3.7 / Sonnet 4	200K tokens	~150K economic sweet spot
Gemini 1.5 Pro	1M tokens	Degraded recall past 200K
Gemini 2.0 Ultra	2M tokens	Best long-ctx recall so far
Llama 3.3 70B	128K tokens	Open weights, self-hosted

Long-term memory architecture

Anything that persists beyond a single context window. Implemented as an external system the LLM reads from (retrieval) or writes to (tool call). The model itself never "remembers" — it reads a summary/chunk injected into context each turn.

Storage backends

VECTOR DB Semantic retrieval

Embed text → store vectors → ANN search at query time. Best for fuzzy, semantic lookup.

Pinecone Qdrant Weaviate pgvector ChromaDB

KV STORE Exact lookup

Redis / DynamoDB for fast exact-key retrieval. User profile, session state, preferences. O(1) lookup, no semantic search.

Redis DynamoDB MongoDB

GRAPH DB Relational memory

Entities + relationships. Best for knowledge graphs, multi-hop reasoning ("Alice works with Bob who knows Carol").

Neo4j Memgraph Graphiti

RELATIONAL Structured data

SQL for tabular user data. LLM generates SQL via Text2SQL or retrieves via structured filters.

PostgreSQL SQLite Supabase

Memory write pipeline

# After each conversation turn — extract + store memories async def persist_memory(user_id, turn_text, llm): # 1. Extract memorable facts from the turn facts = await llm.extract( prompt=f"Extract key facts from: {turn_text}", schema="List[{fact: str, importance: 1-5}]" ) # 2. Deduplicate against existing memories existing = await vector_db.search(query=facts, top_k=5) novel_facts = deduplicate(facts, existing) # 3. Embed + upsert embeddings = await embedder.embed_batch(novel_facts) await vector_db.upsert( vectors=embeddings, metadata={"user_id": user_id, "timestamp": now()} )

Memory read pipeline

# At the start of each turn — retrieve + inject async def build_context(user_id, user_query): query_embedding = await embedder.embed(user_query) memories = await vector_db.search( vector=query_embedding, filter={"user_id": user_id}, top_k=5 ) memory_block = "\n".join([m.text for m in memories]) return f"[Relevant memories]\n{memory_block}\n\n[User query]\n{user_query}"

Implementation methods

Click to expand each method

Architecture by use case

Known problems

Production hacks & tricks

Best practices