vLLM KV Cache — Field Notes

01 — Core concept

What is Automatic Prefix Caching (APC)?

vLLM's KV cache stores computed Key-Value pairs from attention layers so they don't need to be recomputed for tokens already seen. APC extends this: it hashes token sequences (in 16-token blocks via PagedAttention) and reuses KV entries across entirely different requests — as long as they share a common prefix.

You don't "add" KV cache manually. You enable prefix caching and structure your prompts so stable parts always come first.

Key insight: vLLM hashes the raw token sequence. Two requests sharing the same first N tokens = cache hit on those N tokens. The cache is global — shared across all users hitting the same server instance.

Prompt ordering rule — stable content must come first

System prompt

→

Few-shot examples

→

Fixed query component

→

User query (changes)

→

Dynamic data

green → cached after first request | amber → computed fresh every time

02 — KV Cache VRAM formula

KV cache (bytes) = 2 × L × H_kv × D_head × T × dtype_bytes

2Key + Value — both must be stored LNumber of transformer layers (e.g. 32 for 8B, 80 for 70B) H_kvNumber of KV heads — use GQA heads if the model uses GQA, NOT query heads! D_headHead dimension = hidden_size / num_attention_heads (commonly 128) TToken count — your system prompt length in tokens dtype_bytes2 for FP16/BF16 · 1 for FP8 · 4 for FP32

Your original formula was almost right. The correct form is 2 × L × H_kv × D_head × T × dtype_bytes. The "2 bytes" was the dtype for FP16 — not wrong, just one part. The critical distinction is using KV heads specifically (e.g. 8 for Llama 3.1) rather than total attention heads (64). GQA makes the cache 8× smaller than a naive calculation — always check num_key_value_heads in the model config.

Full pool formula (N system prompts):
Total = 2 × L × H_kv × D_head × T_avg × dtype_bytes × N_prompts

This tells you the VRAM needed to hold all N system prompt prefixes simultaneously — i.e. zero-eviction scenario.

Interactive VRAM calculator

Model preset

Layers (L) 32

KV heads (H_kv) 8

Head dim (D_head) 128

Tokens per prompt (T) 512

Number of system prompts 50

Dtype

Per token

—

KB

Per system prompt

—

MB

All prompts (pool)
—
MB total

03 — Your two scenarios

Scenario A — Fixed system prompt + changing text

Ideal for prefix caching. After the first request warms the cache, every subsequent request skips computing the system prompt tokens entirely — only the user query portion requires fresh compute.

Enable with: enable_prefix_caching=True on your LLM instance, or --enable-prefix-caching on the server.

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enable_prefix_caching=True,
    gpu_memory_utilization=0.90,
)

# System prompt must be byte-for-byte identical every call
SYSTEM = "You are a manufacturing AI..."

def infer(user_query: str):
    prompt = f"<|system|>\n{SYSTEM}\n<|user|>\n{user_query}\n<|assistant|>\n"
    return llm.generate([prompt], SamplingParams(max_tokens=512))

Scenario B — Fixed system prompt + fixed visual query + changing images

Partially cacheable. The text prefix (system prompt + fixed query string) is cached normally. The image pixel embeddings are not prefix-cached because they're position-dependent visual tokens recomputed per image.

Component	Cached?	Reason
System prompt tokens	Yes	Pure text prefix — hashed and reused
Fixed query string ("What's happening…")	Yes	Part of stable text prefix
Image visual embeddings	No	Computed fresh per image, position-dependent
Fixed reference image (if always at position 0)	Partial	Model-arch dependent — some vLLM versions cache fixed-position image tokens

Tip: If you have a fixed reference image (e.g. a CAD drawing or defect template), always pass it at position 0 in the image list. Some vision models + vLLM versions will cache its visual tokens on repeated use.

04 — Multi-tenant architecture (1000 req / 50 system prompts)

How the global cache behaves with multiple users

vLLM's prefix cache is server-global and user-agnostic. It caches based purely on token hash matches. With 1000 requests and a pool of 50 system prompts, once all 50 are warmed (after the first ~50 unique requests across the pool), every subsequent request gets a cache hit on its system prompt prefix.

Your middleware layer handles user identity, API key validation, rate limiting, and usage tracking. vLLM never needs to know about users.

Middleware architecture

User request → [Your Middleware]
Validate API key  →  look up user metadata
Map user  →  system_prompt_id (from pool of 50)
Force-inject system prompt  →  strip any user-provided system role
Track: tokens used, latency, cache hit rate, cost
Rate limit per key
  →  [vLLM server — prefix cache is global, unaware of users]

Middleware implementation (FastAPI)

from fastapi import FastAPI, Header, HTTPException
from openai import AsyncOpenAI
import time

app = FastAPI()
vllm_client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="unused")

SYSTEM_PROMPT_POOL = {
    "tier_free":    "You are a helpful assistant...",
    "tier_pro":     "You are an expert analyst...",
    "tenant_acme":  "You are ACME Corp's internal assistant...",
}

API_KEY_REGISTRY = {
    "sk-user-abc": {"tenant": "tenant_acme", "rate_limit": 100},
    "sk-user-xyz": {"tenant": "tier_pro",    "rate_limit": 500},
}

USAGE_LOG = {}

@app.post("/v1/chat/completions")
async def proxy(request: dict, authorization: str = Header(...)):
    api_key = authorization.replace("Bearer ", "")
    if api_key not in API_KEY_REGISTRY:
        raise HTTPException(status_code=401)

    user_meta = API_KEY_REGISTRY[api_key]
    system_prompt = SYSTEM_PROMPT_POOL[user_meta["tenant"]]

    # Force system prompt — user cannot override it
    messages = [
        {"role": "system", "content": system_prompt},
        *[m for m in request["messages"] if m["role"] != "system"],
    ]

    t0 = time.monotonic()
    response = await vllm_client.chat.completions.create(
        model=request.get("model", "meta-llama/Llama-3.1-8B-Instruct"),
        messages=messages,
        max_tokens=request.get("max_tokens", 512),
    )
    latency = time.monotonic() - t0

    # Track usage per API key
    USAGE_LOG.setdefault(api_key, []).append({
        "prompt_tokens":     response.usage.prompt_tokens,
        "completion_tokens": response.usage.completion_tokens,
        "latency_ms":        round(latency * 1000),
        "system_prompt_id":  user_meta["tenant"],
    })

    return response

Cache warmup on startup

With only 50 prompts, pre-warm all of them at server boot so the first real user never pays the cache-miss cost.

async def warm_prefix_cache():
    for tenant_id, prompt in SYSTEM_PROMPT_POOL.items():
        await client.chat.completions.create(
            model="meta-llama/Llama-3.1-8B-Instruct",
            messages=[
                {"role": "system",  "content": prompt},
                {"role": "user",    "content": "hello"},
            ],
            max_tokens=1,    # don't waste output tokens
        )
        print(f"Warmed cache for: {tenant_id}")

Eviction: will your pool fit in VRAM?

vLLM uses LRU eviction. If your 50 system prompts fit entirely in the allocated KV cache VRAM, you get near-100% hit rate with zero thrashing. Use the calculator above to verify. For most 8B–13B models with 512-token prompts, the entire pool of 50 fits in <500 MB — well within budget on any modern GPU.

05 — Model parameter reference

Model	Layers	KV heads	Head dim	Attn type	~Per-token cache (FP16)
Llama 3.1 8B	32	8	128	GQA	~0.5 KB
Llama 3.1 70B	80	8	128	GQA	~2.5 MB
Qwen3 32B	64	8	128	GQA	~1.0 MB
Mistral 7B	32	8	128	GQA	~0.5 KB
GPT-style MHA model	32	32	128	MHA	~4× more than GQA equivalent

GQA (Grouped Query Attention) is why modern models are cache-efficient. Llama 3.1 70B has 64 query heads but only 8 KV heads — the KV cache is 8× smaller than a full MHA model of the same size. Always check num_key_value_heads in the model card / config.json.

06 — Points to watch in production

System prompt must be byte-for-byte identical. The cache key is a hash of the raw token sequence. A single trailing space, a date injection (f"Today is {datetime.now().date()}"), or a different encoding breaks the hash — full cache miss every time. Treat prompts from your pool as immutable constants and version them with IDs.
Chat template consistency. Special tokens added by the chat template (BOS, EOS, role markers) are part of the hashed sequence. If different clients use different templates for the same model, they'll never share cache hits even with identical content string.
FP8 KV cache halves VRAM, minimal quality loss. Use kv_cache_dtype="fp8" to double your effective cache capacity. Requires CUDA 11.8+ and Hopper/Ada GPUs for native hardware support. On older GPUs it falls back to software emulation.
Chunked prefill + prefix caching work together. Long system prompts get prefilled in chunks but their KV blocks are still cached for future reuse. Enable both: --enable-prefix-caching --enable-chunked-prefill.
Monitor cache hit rate via Prometheus. Expose vllm:gpu_prefix_cache_hit_rate. If it drops below ~80% on a warm server, you likely have prompt stability issues — dynamic content being injected into what should be fixed prompts.
LRU eviction can thrash with too many unique long prefixes. A pool of 50 is safe. If you scale to 500+ unique system prompts with long token lengths, calculate your VRAM budget carefully — hot prompts will evict cold ones repeatedly under load.
Multimodal: text prefix is cached, image tokens are not. Don't assume vision requests benefit from prefix caching on the image side. Only the stable text prefix (system prompt + fixed query string) gets the speedup.
Tensor parallelism splits KV cache across GPUs. With --tensor-parallel-size 4, each GPU holds 1/4 of the KV cache. Your VRAM calculation per GPU = total_KV / TP_size. Logical cache hit rate stays the same.
LMCache for cross-instance sharing (advanced). vLLM's built-in prefix cache lives in a single server's VRAM. Multiple vLLM instances behind a load balancer each have independent caches. LMCache adds a shared CPU/disk-backed layer so hits transfer across instances — useful for fleet deployments.
Never inject per-user IDs or timestamps into the system prompt. Even if harmless-seeming, any dynamic injection makes every request a cache miss. Track user identity entirely in your middleware layer — never in the prompt itself.

07 — Quick reference flags

Config flag	What it does	When to use
`--enable-prefix-caching`	Activates APC globally on the server	Always — no downside unless you have zero prefix reuse
`--enable-chunked-prefill`	Prefills long prompts in chunks, improves TTFT	Long context or high concurrency workloads
`kv_cache_dtype="fp8"`	Halves KV VRAM usage	Hopper/Ada GPUs, tight VRAM budgets
`gpu_memory_utilization=0.90`	Reserves 90% VRAM for model + KV cache	Default; lower if OOM during warmup
`--tensor-parallel-size N`	Shards model + KV cache across N GPUs	Models that don't fit on a single GPU
`max_model_len`	Caps max context length (reduces KV pool size)	When you don't need full context but want more cache space for concurrency