vLLM · Inference Engineering
KV Cache — Field Notes
Prefix caching · Multi-tenant architecture · VRAM sizing · Production gotchas
What is Automatic Prefix Caching (APC)?
vLLM's KV cache stores computed Key-Value pairs from attention layers so they don't need to be recomputed for tokens already seen. APC extends this: it hashes token sequences (in 16-token blocks via PagedAttention) and reuses KV entries across entirely different requests — as long as they share a common prefix.

You don't "add" KV cache manually. You enable prefix caching and structure your prompts so stable parts always come first.
Key insight: vLLM hashes the raw token sequence. Two requests sharing the same first N tokens = cache hit on those N tokens. The cache is global — shared across all users hitting the same server instance.
Prompt ordering rule — stable content must come first
System prompt
Few-shot examples
Fixed query component
User query (changes)
Dynamic data
green → cached after first request  |  amber → computed fresh every time
KV cache (bytes) = 2 × L × Hkv × Dhead × T × dtype_bytes
2Key + Value — both must be stored LNumber of transformer layers (e.g. 32 for 8B, 80 for 70B) HkvNumber of KV heads — use GQA heads if the model uses GQA, NOT query heads! DheadHead dimension = hidden_size / num_attention_heads (commonly 128) TToken count — your system prompt length in tokens dtype_bytes2 for FP16/BF16 · 1 for FP8 · 4 for FP32
Your original formula was almost right. The correct form is 2 × L × H_kv × D_head × T × dtype_bytes. The "2 bytes" was the dtype for FP16 — not wrong, just one part. The critical distinction is using KV heads specifically (e.g. 8 for Llama 3.1) rather than total attention heads (64). GQA makes the cache 8× smaller than a naive calculation — always check num_key_value_heads in the model config.
Full pool formula (N system prompts):
Total = 2 × L × Hkv × Dhead × Tavg × dtype_bytes × N_prompts

This tells you the VRAM needed to hold all N system prompt prefixes simultaneously — i.e. zero-eviction scenario.
Interactive VRAM calculator
32
8
128
512
50
Per token
KB
Per system prompt
MB
All prompts (pool)
MB total
Scenario A — Fixed system prompt + changing text
Ideal for prefix caching. After the first request warms the cache, every subsequent request skips computing the system prompt tokens entirely — only the user query portion requires fresh compute.

Enable with: enable_prefix_caching=True on your LLM instance, or --enable-prefix-caching on the server.
llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", enable_prefix_caching=True, gpu_memory_utilization=0.90, ) # System prompt must be byte-for-byte identical every call SYSTEM = "You are a manufacturing AI..." def infer(user_query: str): prompt = f"<|system|>\n{SYSTEM}\n<|user|>\n{user_query}\n<|assistant|>\n" return llm.generate([prompt], SamplingParams(max_tokens=512))
Scenario B — Fixed system prompt + fixed visual query + changing images
Partially cacheable. The text prefix (system prompt + fixed query string) is cached normally. The image pixel embeddings are not prefix-cached because they're position-dependent visual tokens recomputed per image.
ComponentCached?Reason
System prompt tokensYesPure text prefix — hashed and reused
Fixed query string ("What's happening…")YesPart of stable text prefix
Image visual embeddingsNoComputed fresh per image, position-dependent
Fixed reference image (if always at position 0)PartialModel-arch dependent — some vLLM versions cache fixed-position image tokens
Tip: If you have a fixed reference image (e.g. a CAD drawing or defect template), always pass it at position 0 in the image list. Some vision models + vLLM versions will cache its visual tokens on repeated use.
How the global cache behaves with multiple users
vLLM's prefix cache is server-global and user-agnostic. It caches based purely on token hash matches. With 1000 requests and a pool of 50 system prompts, once all 50 are warmed (after the first ~50 unique requests across the pool), every subsequent request gets a cache hit on its system prompt prefix.

Your middleware layer handles user identity, API key validation, rate limiting, and usage tracking. vLLM never needs to know about users.
Middleware architecture
User request → [Your Middleware] 1. Validate API key → look up user metadata 2. Map user → system_prompt_id (from pool of 50) 3. Force-inject system prompt → strip any user-provided system role 4. Track: tokens used, latency, cache hit rate, cost 5. Rate limit per key → [vLLM server — prefix cache is global, unaware of users]
Middleware implementation (FastAPI)
from fastapi import FastAPI, Header, HTTPException from openai import AsyncOpenAI import time app = FastAPI() vllm_client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="unused") SYSTEM_PROMPT_POOL = { "tier_free": "You are a helpful assistant...", "tier_pro": "You are an expert analyst...", "tenant_acme": "You are ACME Corp's internal assistant...", } API_KEY_REGISTRY = { "sk-user-abc": {"tenant": "tenant_acme", "rate_limit": 100}, "sk-user-xyz": {"tenant": "tier_pro", "rate_limit": 500}, } USAGE_LOG = {} @app.post("/v1/chat/completions") async def proxy(request: dict, authorization: str = Header(...)): api_key = authorization.replace("Bearer ", "") if api_key not in API_KEY_REGISTRY: raise HTTPException(status_code=401) user_meta = API_KEY_REGISTRY[api_key] system_prompt = SYSTEM_PROMPT_POOL[user_meta["tenant"]] # Force system prompt — user cannot override it messages = [ {"role": "system", "content": system_prompt}, *[m for m in request["messages"] if m["role"] != "system"], ] t0 = time.monotonic() response = await vllm_client.chat.completions.create( model=request.get("model", "meta-llama/Llama-3.1-8B-Instruct"), messages=messages, max_tokens=request.get("max_tokens", 512), ) latency = time.monotonic() - t0 # Track usage per API key USAGE_LOG.setdefault(api_key, []).append({ "prompt_tokens": response.usage.prompt_tokens, "completion_tokens": response.usage.completion_tokens, "latency_ms": round(latency * 1000), "system_prompt_id": user_meta["tenant"], }) return response
Cache warmup on startup
With only 50 prompts, pre-warm all of them at server boot so the first real user never pays the cache-miss cost.
async def warm_prefix_cache(): for tenant_id, prompt in SYSTEM_PROMPT_POOL.items(): await client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=[ {"role": "system", "content": prompt}, {"role": "user", "content": "hello"}, ], max_tokens=1, # don't waste output tokens ) print(f"Warmed cache for: {tenant_id}")
Eviction: will your pool fit in VRAM?
vLLM uses LRU eviction. If your 50 system prompts fit entirely in the allocated KV cache VRAM, you get near-100% hit rate with zero thrashing. Use the calculator above to verify. For most 8B–13B models with 512-token prompts, the entire pool of 50 fits in <500 MB — well within budget on any modern GPU.
ModelLayersKV headsHead dimAttn type~Per-token cache (FP16)
Llama 3.1 8B328128GQA~0.5 KB
Llama 3.1 70B808128GQA~2.5 MB
Qwen3 32B648128GQA~1.0 MB
Mistral 7B328128GQA~0.5 KB
GPT-style MHA model3232128MHA~4× more than GQA equivalent
GQA (Grouped Query Attention) is why modern models are cache-efficient. Llama 3.1 70B has 64 query heads but only 8 KV heads — the KV cache is 8× smaller than a full MHA model of the same size. Always check num_key_value_heads in the model card / config.json.
Config flagWhat it doesWhen to use
--enable-prefix-cachingActivates APC globally on the serverAlways — no downside unless you have zero prefix reuse
--enable-chunked-prefillPrefills long prompts in chunks, improves TTFTLong context or high concurrency workloads
kv_cache_dtype="fp8"Halves KV VRAM usageHopper/Ada GPUs, tight VRAM budgets
gpu_memory_utilization=0.90Reserves 90% VRAM for model + KV cacheDefault; lower if OOM during warmup
--tensor-parallel-size NShards model + KV cache across N GPUsModels that don't fit on a single GPU
max_model_lenCaps max context length (reduces KV pool size)When you don't need full context but want more cache space for concurrency