01 — Core concept
What is Automatic Prefix Caching (APC)?
vLLM's KV cache stores computed Key-Value pairs from attention layers so they don't need to be recomputed for tokens already seen.
APC extends this: it hashes token sequences (in 16-token blocks via PagedAttention) and reuses KV entries across entirely different requests — as long as they share a common prefix.
You don't "add" KV cache manually. You enable prefix caching and structure your prompts so stable parts always come first.
Key insight: vLLM hashes the raw token sequence. Two requests sharing the same first N tokens = cache hit on those N tokens. The cache is global — shared across all users hitting the same server instance.
Prompt ordering rule — stable content must come first
System prompt
→
Few-shot examples
→
Fixed query component
→
User query (changes)
→
Dynamic data
green → cached after first request |
amber → computed fresh every time
02 — KV Cache VRAM formula
Your original formula was almost right. The correct form is 2 × L × H_kv × D_head × T × dtype_bytes. The "2 bytes" was the dtype for FP16 — not wrong, just one part. The critical distinction is using KV heads specifically (e.g. 8 for Llama 3.1) rather than total attention heads (64). GQA makes the cache 8× smaller than a naive calculation — always check num_key_value_heads in the model config.
Full pool formula (N system prompts):
Total = 2 × L × Hkv × Dhead × Tavg × dtype_bytes × N_prompts
This tells you the VRAM needed to hold all N system prompt prefixes simultaneously — i.e. zero-eviction scenario.
03 — Your two scenarios
Scenario A — Fixed system prompt + changing text
Ideal for prefix caching. After the first request warms the cache, every subsequent request skips computing the system prompt tokens entirely — only the user query portion requires fresh compute.
Enable with: enable_prefix_caching=True on your LLM instance, or --enable-prefix-caching on the server.
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
enable_prefix_caching=True,
gpu_memory_utilization=0.90,
)
# System prompt must be byte-for-byte identical every call
SYSTEM = "You are a manufacturing AI..."
def infer(user_query: str):
prompt = f"<|system|>\n{SYSTEM}\n<|user|>\n{user_query}\n<|assistant|>\n"
return llm.generate([prompt], SamplingParams(max_tokens=512))
Scenario B — Fixed system prompt + fixed visual query + changing images
Partially cacheable. The text prefix (system prompt + fixed query string) is cached normally. The image pixel embeddings are not prefix-cached because they're position-dependent visual tokens recomputed per image.
| Component | Cached? | Reason |
| System prompt tokens | Yes | Pure text prefix — hashed and reused |
| Fixed query string ("What's happening…") | Yes | Part of stable text prefix |
| Image visual embeddings | No | Computed fresh per image, position-dependent |
| Fixed reference image (if always at position 0) | Partial | Model-arch dependent — some vLLM versions cache fixed-position image tokens |
Tip: If you have a fixed reference image (e.g. a CAD drawing or defect template), always pass it at position 0 in the image list. Some vision models + vLLM versions will cache its visual tokens on repeated use.
04 — Multi-tenant architecture (1000 req / 50 system prompts)
How the global cache behaves with multiple users
vLLM's prefix cache is server-global and user-agnostic. It caches based purely on token hash matches. With 1000 requests and a pool of 50 system prompts, once all 50 are warmed (after the first ~50 unique requests across the pool), every subsequent request gets a cache hit on its system prompt prefix.
Your middleware layer handles user identity, API key validation, rate limiting, and usage tracking. vLLM never needs to know about users.
Middleware architecture
User request → [Your Middleware]
1. Validate API key → look up user metadata
2. Map user → system_prompt_id (from pool of 50)
3. Force-inject system prompt → strip any user-provided system role
4. Track: tokens used, latency, cache hit rate, cost
5. Rate limit per key
→ [vLLM server — prefix cache is global, unaware of users]
Middleware implementation (FastAPI)
from fastapi import FastAPI, Header, HTTPException
from openai import AsyncOpenAI
import time
app = FastAPI()
vllm_client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="unused")
SYSTEM_PROMPT_POOL = {
"tier_free": "You are a helpful assistant...",
"tier_pro": "You are an expert analyst...",
"tenant_acme": "You are ACME Corp's internal assistant...",
}
API_KEY_REGISTRY = {
"sk-user-abc": {"tenant": "tenant_acme", "rate_limit": 100},
"sk-user-xyz": {"tenant": "tier_pro", "rate_limit": 500},
}
USAGE_LOG = {}
@app.post("/v1/chat/completions")
async def proxy(request: dict, authorization: str = Header(...)):
api_key = authorization.replace("Bearer ", "")
if api_key not in API_KEY_REGISTRY:
raise HTTPException(status_code=401)
user_meta = API_KEY_REGISTRY[api_key]
system_prompt = SYSTEM_PROMPT_POOL[user_meta["tenant"]]
# Force system prompt — user cannot override it
messages = [
{"role": "system", "content": system_prompt},
*[m for m in request["messages"] if m["role"] != "system"],
]
t0 = time.monotonic()
response = await vllm_client.chat.completions.create(
model=request.get("model", "meta-llama/Llama-3.1-8B-Instruct"),
messages=messages,
max_tokens=request.get("max_tokens", 512),
)
latency = time.monotonic() - t0
# Track usage per API key
USAGE_LOG.setdefault(api_key, []).append({
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
"latency_ms": round(latency * 1000),
"system_prompt_id": user_meta["tenant"],
})
return response
Cache warmup on startup
With only 50 prompts, pre-warm all of them at server boot so the first real user never pays the cache-miss cost.
async def warm_prefix_cache():
for tenant_id, prompt in SYSTEM_PROMPT_POOL.items():
await client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": "hello"},
],
max_tokens=1, # don't waste output tokens
)
print(f"Warmed cache for: {tenant_id}")
Eviction: will your pool fit in VRAM?
vLLM uses LRU eviction. If your 50 system prompts fit entirely in the allocated KV cache VRAM, you get near-100% hit rate with zero thrashing. Use the calculator above to verify. For most 8B–13B models with 512-token prompts, the entire pool of 50 fits in <500 MB — well within budget on any modern GPU.
05 — Model parameter reference
| Model | Layers | KV heads | Head dim | Attn type | ~Per-token cache (FP16) |
| Llama 3.1 8B | 32 | 8 | 128 | GQA | ~0.5 KB |
| Llama 3.1 70B | 80 | 8 | 128 | GQA | ~2.5 MB |
| Qwen3 32B | 64 | 8 | 128 | GQA | ~1.0 MB |
| Mistral 7B | 32 | 8 | 128 | GQA | ~0.5 KB |
| GPT-style MHA model | 32 | 32 | 128 | MHA | ~4× more than GQA equivalent |
GQA (Grouped Query Attention) is why modern models are cache-efficient. Llama 3.1 70B has 64 query heads but only 8 KV heads — the KV cache is 8× smaller than a full MHA model of the same size. Always check num_key_value_heads in the model card / config.json.
06 — Points to watch in production
-
System prompt must be byte-for-byte identical. The cache key is a hash of the raw token sequence. A single trailing space, a date injection (
f"Today is {datetime.now().date()}"), or a different encoding breaks the hash — full cache miss every time. Treat prompts from your pool as immutable constants and version them with IDs.
-
Chat template consistency. Special tokens added by the chat template (BOS, EOS, role markers) are part of the hashed sequence. If different clients use different templates for the same model, they'll never share cache hits even with identical content string.
-
FP8 KV cache halves VRAM, minimal quality loss. Use
kv_cache_dtype="fp8" to double your effective cache capacity. Requires CUDA 11.8+ and Hopper/Ada GPUs for native hardware support. On older GPUs it falls back to software emulation.
-
Chunked prefill + prefix caching work together. Long system prompts get prefilled in chunks but their KV blocks are still cached for future reuse. Enable both:
--enable-prefix-caching --enable-chunked-prefill.
-
Monitor cache hit rate via Prometheus. Expose
vllm:gpu_prefix_cache_hit_rate. If it drops below ~80% on a warm server, you likely have prompt stability issues — dynamic content being injected into what should be fixed prompts.
-
LRU eviction can thrash with too many unique long prefixes. A pool of 50 is safe. If you scale to 500+ unique system prompts with long token lengths, calculate your VRAM budget carefully — hot prompts will evict cold ones repeatedly under load.
-
Multimodal: text prefix is cached, image tokens are not. Don't assume vision requests benefit from prefix caching on the image side. Only the stable text prefix (system prompt + fixed query string) gets the speedup.
-
Tensor parallelism splits KV cache across GPUs. With
--tensor-parallel-size 4, each GPU holds 1/4 of the KV cache. Your VRAM calculation per GPU = total_KV / TP_size. Logical cache hit rate stays the same.
-
LMCache for cross-instance sharing (advanced). vLLM's built-in prefix cache lives in a single server's VRAM. Multiple vLLM instances behind a load balancer each have independent caches. LMCache adds a shared CPU/disk-backed layer so hits transfer across instances — useful for fleet deployments.
-
Never inject per-user IDs or timestamps into the system prompt. Even if harmless-seeming, any dynamic injection makes every request a cache miss. Track user identity entirely in your middleware layer — never in the prompt itself.
07 — Quick reference flags
| Config flag | What it does | When to use |
--enable-prefix-caching | Activates APC globally on the server | Always — no downside unless you have zero prefix reuse |
--enable-chunked-prefill | Prefills long prompts in chunks, improves TTFT | Long context or high concurrency workloads |
kv_cache_dtype="fp8" | Halves KV VRAM usage | Hopper/Ada GPUs, tight VRAM budgets |
gpu_memory_utilization=0.90 | Reserves 90% VRAM for model + KV cache | Default; lower if OOM during warmup |
--tensor-parallel-size N | Shards model + KV cache across N GPUs | Models that don't fit on a single GPU |
max_model_len | Caps max context length (reduces KV pool size) | When you don't need full context but want more cache space for concurrency |