Internal Engineering Reference · v2.1

LLM / VLM Quantization
& Inference Handbook

Production-grade reference for selecting quantization methods, inference engines, and deployment strategies. Focused on trade-offs and real engineering decisions — not textbook theory.

15 Sections
10 Quantization Methods
7 Inference Engines
Production Deployment Patterns
§ 01
Core ConceptsFoundations of Quantization
What is Quantization?

Quantization is the process of reducing the numerical precision of model weights and/or activations from high-precision floating point (FP32/FP16) to lower-precision formats (INT8, INT4, FP8). The goal is to reduce model memory footprint and accelerate matrix multiply operations — the dominant operation in transformer inference — at the cost of some representational accuracy.

Fundamentally: instead of storing a weight as a 32-bit float, you store it as an 8-bit or 4-bit integer, then scale it back during computation using a stored scale factor.

Why Quantization is Required
💾 Memory Wall

A 70B FP16 model needs ~140 GB VRAM. With INT4 it drops to ~35 GB — fits on a single 40GB A100 or 2×RTX 3090s. Without quantization, large models are simply unrunnable on most hardware.

⚡ Throughput Bottleneck

LLM inference is memory-bandwidth bound, not compute bound. Smaller dtypes = less data moved per token = higher throughput. INT8 can be 2× faster than FP16 on bandwidth-saturated hardware.

💰 Cost Reduction

Smaller models fit on cheaper GPUs. A model that required an A100 in FP16 might run on an RTX 4090 in INT4. Cost per token can drop 5–10×. Critical for API-serving businesses.

Core Trade-off Triangle
Lower Precision →MemoryLatency / ThroughputAccuracy
FP32🔴 4× baseline🔴 Slowest🟢 Full accuracy
FP16 / BF16🟡 2× baseline🟡 ~2× faster🟢 Negligible loss
INT8🟢 1× baseline🟢 2–3× faster🟡 <1% degradation
INT4🟢 0.5× baseline🟢 3–4× faster🟡 1–3% degradation
Quantization vs Pruning vs Distillation
TechniqueWhat it doesAccuracy CostBest For
QuantizationReduces precision of weights/activationsLow (0.5–3%)Inference speed, VRAM reduction
PruningRemoves unimportant weights (sets to 0)Medium (can be high)Structured sparsity, specialized hardware
DistillationTrains smaller model to mimic larger oneDepends on size gapPermanent smaller model, edge deployment
CombinedQAT + pruning + distillationControllableMaximum compression with accuracy budget
Rule: Quantization is almost always the first optimization to apply. It's non-destructive, reversible, and has well-established tooling. Pruning and distillation require retraining.
§ 02
ReferenceTypes of Quantization
Core Floating Point Formats
FP16
16-bit · Half Precision
IEEE 754 half-precision. 1 sign bit, 5 exponent bits, 10 mantissa bits. Dynamic range: ~6×10⁻⁵ to 65504. The default "quantized" format for most GPU inference today.
Memory vs FP32
2× reduction
Hardware
All modern GPUs
Accuracy Loss
Negligible
CUDA Tensor Cores
✅ Fully utilized
Pros
  • No accuracy loss vs FP32
  • Native GPU support everywhere
  • Default for HF Transformers
Cons
  • Can overflow (max 65504)
  • Less stable for training
  • 2× FP32 memory still large
Production Default vLLM Native All NVIDIA GPUs
BF16
16-bit · Brain Float
Google Brain's format. Same 1 sign bit, but 8 exponent bits (same as FP32!) and only 7 mantissa bits. Same dynamic range as FP32 but less precision. Critical: no gradient overflow during training.
Dynamic Range
Same as FP32
Hardware
A100, H100, RTX 3090+
vs FP16
More stable training
Best For
Fine-tuning + inference
Rule: Prefer BF16 over FP16 for fine-tuning (no overflow) and for inference on Ampere+ hardware (A100, H100, RTX 3090+). FP16 on older Volta/Turing (V100, T4).
INT8
8-bit · Integer
8-bit integer quantization. Weights and/or activations are mapped to [-128, 127] using scale factors. Two flavors: W8A8 (weights + activations both INT8, needs calibration) and W8A16 (only weights INT8, activations stay FP16 — simpler, more common).
Memory vs FP16
2× reduction
Throughput Gain
1.5–2.5× faster
Accuracy Loss
<1% (W8A16)
Calibration
Needed for W8A8
Pros
  • Strong accuracy preservation
  • Widely supported
  • Excellent for production APIs
Cons
  • W8A8 needs calibration data
  • Some activations hard to quantize
  • Outlier problem (see SmoothQuant)
INT4 / 4-bit
4-bit · Integer
Maps weights to 16 discrete values [-8, 7]. Groups of 64–128 weights share a scale factor (group quantization). Almost always W4A16 (activations stay FP16). The dominant format for consumer GPU inference today.
Memory vs FP16
4× reduction
Accuracy Loss
1–3% (model-dependent)
Group Size
g=128 typical
Sweet Spot
7B–70B models
Warning: Small models (<3B) degrade significantly with INT4. Larger models (70B+) are surprisingly robust. Always benchmark on your task before deploying INT4 in production.
Consumer GPU Sweet Spot AWQ / GPTQ / GGUF
FP8
8-bit · Float (E4M3 / E5M2)
Two FP8 variants: E4M3 (4 exponent, 3 mantissa — better for weights/activations) and E5M2 (5 exponent, 2 mantissa — better for gradients). Requires H100 or Ada Lovelace (RTX 4090) hardware. Native CUDA support in H100 Transformer Engine.
Memory vs FP16
2× reduction
vs INT8
Better accuracy
Hardware
H100, RTX 4090 only
vLLM support
✅ With --dtype fp8
Rule: If you have H100s, FP8 is the best inference format — better accuracy than INT8 at the same memory, with native hardware acceleration. On older hardware, fall back to INT8/INT4.
Advanced Quantization Methods
GPTQ
Post-Training · 4-bit / 8-bit
Generative Pre-trained Transformer Quantization. Uses second-order weight information (Hessians) to minimally perturb weights during quantization. Offline quantization — run once, load the result. Produces W4A16 models. Industry standard for 4-bit inference.
Quantization Time
Minutes to hours (offline)
Accuracy vs raw INT4
Significantly better
Calibration Data
~128 samples needed
Inference Engine
vLLM, TGI, AutoGPTQ
Pros
  • High accuracy preservation
  • Mature tooling (AutoGPTQ)
  • Pre-quantized models on HF Hub
Cons
  • Slow offline quantization
  • Less efficient than AWQ at inference time
  • Slightly worse perplexity than AWQ
AutoGPTQ TheBloke models vLLM supported
AWQ
Activation-Aware Weight Quantization
Key insight: Not all weights are equal — weights corresponding to high-activation channels cause disproportionate quantization error. AWQ identifies these "salient" channels by looking at activation magnitudes, then either keeps them in higher precision or scales them pre-quantization. Results in significantly better accuracy than GPTQ at same bit-width.
vs GPTQ
Better perplexity
Inference Speed
Faster (efficient kernels)
Calibration
Small unlabeled dataset
Hardware
NVIDIA GPU required
Rule: AWQ is the preferred 4-bit format for production GPU serving. Use AutoAWQ for quantization. Load with vLLM or TGI. Better quality AND faster than GPTQ in most benchmarks.
Preferred for vLLM AutoAWQ Production Recommended
GGUF
llama.cpp format · CPU/GPU
GGUF (GGML Unified Format) — the file format and quantization scheme used by llama.cpp. Supports a range of quantization levels: Q2_K, Q3_K_S/M/L, Q4_K_S/M, Q5_K_S/M, Q6_K, Q8_0. Uses k-quants with per-block scales and mixed precision within a block. CPU-first design.
CPU inference
✅ First-class
GPU offload
Partial layers
Best format
Q4_K_M (accuracy/speed)
Apple Silicon
✅ MPS acceleration
Q4_K_M is the sweet spot — best quality-to-size ratio for GGUF. Use Q5_K_M if you have extra RAM and want near-FP16 quality.
CPU Deployment Mac M-series llama.cpp / Ollama
bitsandbytes
8-bit / 4-bit · NF4
Tim Dettmers' library. 8-bit uses vector-wise quantization with mixed 8-bit precision for outliers. 4-bit uses NF4 (NormalFloat4) — an information-theoretically optimal 4-bit type for normally distributed weights. Primary use-case: QLoRA fine-tuning. For inference, AWQ/GPTQ have better kernel performance.
NF4 vs INT4
Better for normal dists
Primary Use
QLoRA training
Inference Speed
⚠️ Slower than AWQ
Double Quant
Quantize scale factors too
Warning: bitsandbytes is not optimized for pure inference throughput. Use it for QLoRA training or quick experiments. In production serving, prefer AWQ or GPTQ quantized models.
🔄 QLoRA — Training vs Inference Distinction

QLoRA training: Base model loaded in 4-bit NF4 (frozen). LoRA adapters trained in BF16. This dramatically reduces training VRAM. The 4-bit base is ONLY used to compute gradients for the adapters.

After QLoRA training: You have a base 4-bit model + LoRA adapter weights in BF16. For inference, you can: (a) serve base model + merge adapters at runtime, or (b) dequantize → merge → re-quantize with AWQ/GPTQ for production. Option (b) is better for serving.

Critical: QLoRA-trained models in raw bitsandbytes format are NOT efficient for serving. Always merge and re-export for production.
🔧 SmoothQuant & ZeroQuant

SmoothQuant

Tackles the outlier problem in activations: some activation channels are 100× larger than others, making them hard to quantize. SmoothQuant mathematically migrates the quantization difficulty from activations to weights (which are easier to quantize) by introducing a per-channel scale factor. Enables W8A8 quantization with minimal accuracy loss.

W8A8 enablerTensorRT-LLM

ZeroQuant

Microsoft's method for W8A8 at the operator level. Uses token-wise quantization for activations and weight-wise for weights. Integrated into DeepSpeed. Supports INT4/INT8 with hardware-aware quantization kernels. Part of the ZeroQuant-V2 and ZeroQuant-FP extensions.

DeepSpeedW8A8
§ 03
Decision GuideQuantization Decision Framework

Hardware → Format Rules

CPU only
GGUF Q4_K_M
via
llama.cpp / Ollama
Mac M1/M2/M3
GGUF Q4_K_M or Q5_K_M
via
llama.cpp (MPS)
RTX 3090 / 4090 (24GB)
AWQ 4-bit or FP16
via
vLLM
A100 / A6000 (40-80GB)
FP16 / BF16 or AWQ
via
vLLM
H100 (80GB)
FP8 E4M3
via
vLLM / TensorRT-LLM
Multi-GPU (2–8× A100)
FP16 + Tensor Parallel
via
vLLM (--tensor-parallel-size N)
Edge / Jetson / Mobile
INT4 or GGUF Q4
via
ONNX Runtime / llama.cpp
Use-Case Rules
High throughput API (>1000 req/s): FP8 on H100s with vLLM. Enable continuous batching + prefix caching. This is non-negotiable for cost efficiency at scale.
Latency-critical (<50ms TTFT): FP16 or BF16 — quantization usually doesn't help latency for single requests, only throughput. Prefill optimization matters more.
Fitting model on consumer GPU: AWQ 4-bit. A 70B model fits on 2× RTX 3090. A 13B model fits on a single RTX 3080 10GB. A 7B model on RTX 3060 12GB.
Fine-tuned model deployment: Merge LoRA → full model → quantize with AWQ → serve with vLLM. Never serve raw bitsandbytes models at scale.
Accuracy is critical (medical/legal): FP16 minimum. If you must quantize, use INT8 with SmoothQuant calibration and always run task-specific evals before deploying.
VLM (vision-language model): Quantize text decoder aggressively (INT4/AWQ), but be conservative with vision encoder (FP16 or INT8 at minimum). Vision encoder activations are more sensitive.
Model Size × VRAM Matrix
Model SizeFP16 VRAMINT8 VRAMINT4 VRAMMin GPU (INT4)
3B6 GB3 GB1.8 GBRTX 3060 (8GB)
7B14 GB7 GB4 GBRTX 3060 12GB
13B26 GB13 GB7 GBRTX 3080 10GB
34B68 GB34 GB17 GB2× RTX 3090
70B140 GB70 GB35 GB2× A100 40GB / 2× RTX 3090
405B810 GB405 GB202 GB8× A100 80GB (INT4)
Note: Add ~20–30% for KV cache overhead at serving time. These are weights-only estimates.
§ 04
ReferenceHardware Compatibility Matrix
HardwareBest QuantFP8INT8INT4GGUFNotes
H100 80GB SXMFP8 E4M3Native FP8 Tensor Cores. Best-in-class inference. Use TensorRT-LLM or vLLM.
A100 80GBFP16 / AWQ⚠️No native FP8. BF16 Tensor Cores. Workhorse for production LLM serving.
RTX 4090 24GBAWQ 4-bit⚠️Ada Lovelace. Software FP8 support. Great value for single-GPU inference.
RTX 3090 24GBAWQ 4-bit / INT8Ampere. BF16 supported. Popular for local 70B (2×) or 13B single GPU serving.
RTX 3080 10GBAWQ 4-bit⚠️10GB limits to 7B-13B INT4. Tight for INT8 on 13B.
T4 16GB (Cloud)INT8Turing. No BF16. Common in GCP/AWS for cheap inference. FP16 inference.
V100 16/32GBINT8⚠️Volta. No BF16, no INT4 Tensor Cores. Use bitsandbytes INT8 carefully.
Apple M2/M3 (MPS)GGUF Q4_K_M⚠️Unified memory. llama.cpp with MPS. Outstanding for local dev / 7B-13B models.
CPU (x86)GGUF Q4_K_M⚠️⚠️AVX2/AVX-512 for acceleration. llama.cpp. Slow but cost-effective for low traffic.
Jetson Orin (Edge)INT4 / GGUFAmpere GPU + ARM. Use TensorRT for INT8. GGUF Q4 for flexibility.
§ 05
ImplementationQuantization Methods — Code & Usage
5.1 bitsandbytes — 4-bit & 8-bit Loading
Python · bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 4-bit NF4 (QLoRA style — for fine-tuning or quick experiments)
bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NF4 for normally-distributed weights
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16 during forward pass
    bnb_4bit_use_double_quant=True,      # Quantize scale factors too (~0.4 bits extra saving)
)

# 8-bit INT8 (better for inference accuracy)
bnb_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,  # Outlier threshold — weights above go to FP16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8b-hf",
    quantization_config=bnb_config_4bit,
    device_map="auto",  # Automatically distributes across available GPUs
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b-hf")
5.2 GPTQ — Offline Quantization
Python · AutoGPTQ
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

# Step 1: Quantize (run once offline, save result)
quantize_config = BaseQuantizeConfig(
    bits=4,         # 4-bit quantization
    group_size=128,  # Larger = better quality, slightly more VRAM
    desc_act=False,  # False = faster inference; True = better accuracy
)

model = AutoGPTQForCausalLM.from_pretrained("meta-llama/Llama-3-8b-hf", quantize_config)
model.quantize(calibration_examples)  # List of tokenized text samples (~128)
model.save_quantized("./llama3-8b-gptq-4bit")

# Step 2: Load the quantized model for inference
model = AutoGPTQForCausalLM.from_quantized(
    "./llama3-8b-gptq-4bit",
    device_map="auto",
    use_triton=True,  # Faster Triton kernels if available
)
5.3 AWQ — Activation-Aware Quantization
Python · AutoAWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# Quantize offline
model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3-8b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b-hf")

quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized("./llama3-8b-awq-4bit")

# Load in vLLM (preferred serving path)
# vllm serve ./llama3-8b-awq-4bit --quantization awq
5.4 GGUF — Convert & Run on CPU
Bash · llama.cpp conversion
# Step 1: Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j8  # or: cmake -B build && cmake --build build --config Release

# Step 2: Convert HF model to GGUF F16
python convert_hf_to_gguf.py /path/to/llama3-8b-hf --outtype f16 --outfile llama3-8b-f16.gguf

# Step 3: Quantize to Q4_K_M (recommended)
./llama-quantize llama3-8b-f16.gguf llama3-8b-q4km.gguf Q4_K_M

# Step 4: Run inference (CPU)
./llama-cli -m llama3-8b-q4km.gguf -p "Explain quantization in one paragraph:" -n 256

# With partial GPU offload (e.g. 20 layers to GPU)
./llama-cli -m llama3-8b-q4km.gguf -ngl 20 -p "Hello"

# Serve as OpenAI-compatible API
./llama-server -m llama3-8b-q4km.gguf --port 8080
5.5 FP8 with vLLM
Bash · vLLM FP8 (H100 required)
# Static FP8 quantization via vLLM's quantization toolkit
python -m llmcompressor.transformers.compression.compress \
  --model meta-llama/Llama-3-70b-hf \
  --recipe fp8_recipe.yaml \
  --save_dir ./llama3-70b-fp8

# Or: dynamic FP8 (no calibration needed, slightly lower quality)
vllm serve meta-llama/Llama-3-70b-hf \
  --dtype float16 \
  --quantization fp8 \
  --tensor-parallel-size 4
§ 06
Critical SectionInference Engines
vLLM
UC Berkeley · Production GPU LLM Serving

The default choice for production GPU inference. Built around PagedAttention for near-zero KV cache waste, continuous batching, and a growing list of quantization support. OpenAI-compatible API out of the box. Python-native, easy to deploy.

Strengths
  • PagedAttention (near-zero memory waste)
  • Continuous batching
  • AWQ, GPTQ, FP8, INT8 support
  • Tensor parallelism built-in
  • OpenAI API compatible
Weaknesses
  • No CPU inference
  • NVIDIA GPU only (AMD experimental)
  • Less tunable than TensorRT
  • Not ideal for very small models
FP16 / BF16AWQGPTQFP8INT8
🚀
TensorRT-LLM
NVIDIA · Maximum GPU Throughput

NVIDIA's production-grade engine. Compiles models into TensorRT graphs with fused kernels, optimal memory layouts, and hardware-specific optimizations. Highest raw throughput on NVIDIA hardware, but complex to set up and inflexible (model format locked to TRT).

Strengths
  • Maximum throughput on NVIDIA
  • FP8, INT8, SmoothQuant native
  • Speculative decoding, LoRA
  • Production-tested by NVIDIA
Weaknesses
  • Complex build pipeline
  • NVIDIA-only, CUDA-locked
  • Slow model compilation step
  • Less flexible than vLLM
FP8INT8SmoothQuantNVIDIA Only
🦙
llama.cpp
Georgi Gerganov · CPU-First Inference

The CPU inference champion. Pure C++ with AVX/AVX2/AVX-512 and Apple MPS support. GGUF format with k-quants (Q2_K through Q8_0). Supports partial GPU offload. Default choice for local/CPU/edge deployments. Powers Ollama, LM Studio, Jan.

Strengths
  • CPU and Apple Silicon native
  • Partial GPU offload
  • OpenAI API via llama-server
  • Wide model support
Weaknesses
  • Not for high-throughput GPU serving
  • GGUF format only
  • No continuous batching
🤗
HuggingFace Transformers
Experimentation & Fine-tuning

The research and experimentation standard. Not optimized for production throughput but unmatched in flexibility and model coverage. Use for: fine-tuning, custom architectures, research, quick testing. Integrates bitsandbytes, GPTQ, and AWQ via quantization configs. Not for production serving at scale.

Production warning: HF Transformers has no continuous batching, poor memory management, and naive sequential batching. At scale (>10 req/s), switch to vLLM or TGI.
🏭
TGI (Text Generation Inference)
HuggingFace · Production Serving

HuggingFace's production inference server. Continuous batching, tensor parallelism, GPTQ/AWQ support. Slightly less throughput than vLLM on benchmarks but very well-integrated with HF Hub models. Good choice if your stack is HF-centric.

FP16GPTQAWQHF Hub native
🔷
ONNX Runtime
Microsoft · Cross-Platform Edge/Cloud

Export-compile-deploy pipeline. Best for: cross-platform deployment (Windows, iOS, Android, WebAssembly), small models, edge devices. Supports INT8 quantization and DirectML for non-NVIDIA GPUs. Not ideal for large autoregressive LLMs — optimized for encoder models and smaller decoders.

Edge DevicesINT8DirectML
🌊
DeepSpeed Inference
Microsoft · Multi-GPU Large Models

Microsoft's inference optimization library. Kernel injection for fused ops, ZeroQuant INT4/INT8 quantization, and tensor parallelism. Best for very large models (>100B) on multi-GPU clusters where model doesn't fit in standard configurations. Also used for DeepSpeed-Chat serving pipeline.

100B+ ModelsZeroQuantMulti-GPU
§ 07
Reference MatrixEngine × Quantization Compatibility
EngineFP32FP16/BF16INT8INT4FP8GPTQAWQGGUFNF4 (bnb)
vLLM⚠️
TensorRT-LLM
llama.cpp⚠️⚠️
HF Transformers⚠️
TGI⚠️
ONNX Runtime⚠️
DeepSpeed⚠️

Native support ⚠️ Partial / experimental Not supported

§ 08
Critical DifferencesLLM vs VLM Quantization
Why VLM Quantization is Harder

Vision-Language Models (LLaVA, Qwen-VL, InternVL, Idefics, etc.) consist of at minimum two components: a vision encoder (ViT-based) and a text decoder (transformer LLM). These have fundamentally different quantization characteristics.

🖼️ Vision Encoder
  • Processes raw pixel values → patch embeddings
  • Activations have high variance and extreme outliers
  • Many operations are not matmul-dominated (Conv2d, LayerNorm)
  • INT4 causes severe visual distortion artifacts
  • FP16 or INT8 with careful calibration is the minimum
  • CLIP encoders particularly sensitive
📝 Text Decoder
  • Standard autoregressive transformer
  • Can be aggressively quantized (INT4/AWQ)
  • Larger proportion of model VRAM
  • Standard LLM quantization techniques apply
  • AWQ or GPTQ 4-bit works well here
Critical rule: Never apply INT4 to vision encoders in production. The quantization error propagates into multimodal embeddings and causes hallucinations that are hard to detect — the model may confidently describe a completely different image.
Recommended VLM Quantization Strategy
Vision Encoder
Keep FP16 or INT8 with SmoothQuant
Cross-Attention / MLP Projector
INT8 W8A16 (conservative)
Text Decoder LLM
AWQ 4-bit or INT8
Mixed precision total
~60% memory savings vs full FP16
Common VLM Quantization Failure Modes
Failure ModeRoot CauseFix
Wrong object identificationVision encoder INT4 causes embedding distortionUpgrade vision encoder to FP16
Color/count errorsPatch embedding quantization errorUse INT8 for vision encoder, calibrate
Coherent text but wrong imageProjector layer quantized aggressivelyKeep projector in FP16
Increased hallucination rateMultimodal embedding space corruptedRun visual QA benchmarks (VQAv2) post-quant
Slow inference despite quantizationMixed precision dequantization overheadProfile with nsight; batch vision pre-processing
§ 09
End-to-EndDeployment Pipelines
Pipeline A: GPTQ → vLLM Production Serving
1
Prepare Calibration Dataset
Select ~128 representative text samples from your use case. Domain-specific data gives better calibration than generic data.
2
Run GPTQ Quantization
auto-gptq quantize --bits 4 --group-size 128 --model llama3-70b --output ./llama3-70b-gptq4b. Takes 1–4 hours on A100.
3
Validate Perplexity
Measure WikiText-2 perplexity of quantized vs original. Accept if delta <0.5 for most models. Run task-specific evals (HellaSwag, MMLU) for accuracy-critical apps.
4
Serve with vLLM
vllm serve ./llama3-70b-gptq4b --quantization gptq --tensor-parallel-size 2 --port 8000
5
Load Test
Use locust or wrk to verify throughput under expected load. Monitor GPU memory and token/sec.
Pipeline B: GGUF for Local CPU/Mac Deployment
1
Download HF Model
huggingface-cli download meta-llama/Llama-3-8b-hf --local-dir ./llama3-8b
2
Convert to GGUF FP16
python llama.cpp/convert_hf_to_gguf.py ./llama3-8b --outtype f16 --outfile llama3-8b-f16.gguf
3
Quantize to Q4_K_M
./llama-quantize llama3-8b-f16.gguf llama3-8b-q4km.gguf Q4_K_M. ~4GB output for 8B model.
4
Serve with llama-server
./llama-server -m llama3-8b-q4km.gguf -c 4096 --port 8080 -ngl 99 (ngl=layers on GPU, 99=all)
Pipeline C: INT8 + TensorRT-LLM on H100
1
Calibrate with SmoothQuant
Run SmoothQuant calibration to migrate outliers from activations to weights. Requires 512 calibration samples.
2
Build TensorRT Engine
python convert_checkpoint.py --model_dir llama3-70b --dtype float16 --use_smooth_quant --int8_kv_cache
3
Compile with trtllm-build
trtllm-build --checkpoint_dir ./trt_ckpts --output_dir ./trt_engines --gemm_plugin float16. This step takes 20–60 min.
4
Deploy via Triton Server
Serve TRT engine via NVIDIA Triton Inference Server with TensorRT-LLM backend for production scale.
§ 10
PerformanceOptimization Techniques
📦 KV Cache Optimization

KV cache = the biggest VRAM consumer after weights. For 70B, FP16 KV cache can exceed model weights at long contexts.

  • PagedAttention (vLLM): Non-contiguous KV blocks, near-zero waste, enables sharing for parallel sampling
  • INT8 KV cache: --kv-cache-dtype int8 in vLLM. 50% KV cache memory reduction, <0.5% quality loss
  • KV cache quantization: per-token scale factors for activations; more accurate than per-tensor
  • Context length management: Set --max-model-len conservatively. Longer context = quadratic KV growth
🔄 Continuous Batching

The key throughput unlock. Unlike static batching (wait for all requests, process together), continuous batching slots in new requests as soon as GPU compute is free.

  • Eliminates idle GPU time between requests
  • Supported in vLLM, TGI, TensorRT-LLM natively
  • 3–10× throughput improvement over naive batching
  • Critical for mixed-length workloads
⏱️ Prefill vs Decode Optimization

Two distinct phases with different bottlenecks:

  • Prefill: compute-bound (processes entire prompt in one pass). Optimize with chunked prefill to overlap with decode.
  • Decode: memory-bandwidth bound (one token at a time). Optimize with batching (more sequences = better GPU utilization) and quantization.
  • Chunked prefill (vLLM v1): Break long prompts into chunks, interleave with decode steps to avoid decode stalls.
🎯 Speculative Decoding

Use a small draft model (e.g. 1B) to speculatively generate 3–5 tokens, then verify with the target model in a single forward pass. Effective for latency-bound workloads.

  • Reduces decode latency 1.5–2.5× for appropriate tasks
  • No accuracy loss (mathematically equivalent)
  • Supported in vLLM, TensorRT-LLM
  • Works best for predictable output distributions
Prefix Caching (APC)

Store and reuse KV cache for repeated prompt prefixes (system prompts, RAG context). In vLLM, Automatic Prefix Caching (APC) is enabled by default in v1. Eliminates redundant prefill computation for requests sharing a long common prefix. For chatbots with a fixed system prompt, this can reduce effective compute by 40–60%.

Production rule: Always use APC in vLLM. Structure your prompts to front-load the shared prefix (system prompt + RAG docs) before the user message. Ensure prompt stability — even small changes break cache hits.
§ 11
TroubleshootingDebugging & Failure Modes
IssueLikely CauseDiagnosisFix
Significant accuracy drop after INT4 Model too small, or high-variance activations Compare perplexity before/after. Run task evals. Switch to INT8 or AWQ with smaller group size (g=64)
OOM during quantization Full model loaded in FP16 before quantization nvidia-smi, torch memory profiler Quantize layer-by-layer, use CPU offload during GPTQ
OOM during serving KV cache overflows, batch too large Check vLLM logs for KV cache usage Reduce --max-model-len, add --gpu-memory-utilization 0.85, use INT8 KV cache
Slow inference despite GPU Memory bandwidth saturation, small batch size Profile with nsight; check tokens/sec per GPU Increase batch size, enable tensor parallelism, check for CPU-GPU transfer bottlenecks
Repetitive or incoherent output Quantization degraded attention mechanisms Compare outputs vs FP16 on same prompts Increase precision (INT8→FP16), check if attention layers are quantized
GGUF slower than expected on CPU Missing AVX2/AVX-512, wrong thread count ./llama-cli --help check BLAS backend Compile with GGML_AVX2=ON, set -t to physical cores
VLM seeing wrong image content Vision encoder over-quantized Test with known images, compare to FP16 baseline Keep vision encoder in FP16, only quantize text decoder
vLLM prefix cache miss rate high Prompt instability, timestamp/dynamic content in prefix Check vLLM metrics endpoint for cache hit rate Move dynamic content to end of prompt, stabilize system prompt
TensorRT build fails CUDA version mismatch, unsupported op TRT build logs, check CUDA compute capability Match TRT version to CUDA driver, verify model is in supported list
§ 12
Applied ExamplesReal-World Scenarios
🖥️ Low-Cost Chatbot on CPU CPU / Edge
Model
Llama-3-8B or Mistral-7B
Quantization
GGUF Q4_K_M
Engine
llama.cpp / Ollama

Deploy on a $5/month VPS (4 cores, 8GB RAM) or even a local Mac Mini. ~2–5 tokens/sec on CPU. Suitable for <5 concurrent users. Ollama gives a one-command server. Total infra cost: near zero.

~3.5GB RAMNo GPU$0 GPU cost
⚡ High-Throughput API (1000+ req/s) Production Scale
Model
Llama-3-70B or Mixtral 8×7B
Quantization
FP8 (H100) or AWQ 4-bit (A100)
Engine
vLLM + tensor parallelism

8× H100 cluster with vLLM in FP8. Continuous batching + prefix caching. INT8 KV cache. Target: 5,000 tokens/sec/GPU. Use a load balancer across multiple vLLM instances. Horizontal scaling with Kubernetes. With FP8 on H100: ~40% cost reduction vs FP16 on A100s.

Bash
vllm serve meta-llama/Llama-3-70B \
  --tensor-parallel-size 8 \
  --dtype float16 \
  --quantization fp8 \
  --enable-prefix-caching \
  --kv-cache-dtype fp8_e5m2 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90
📷 Multi-Camera VLM System VLM Production
Model
InternVL-2-26B or LLaVA-1.6
Quantization
Vision: FP16, Text: AWQ 4-bit
Engine
vLLM (VLM support) or custom

Smart factory use case: 8 cameras feed images to VLM for defect detection. Batch images from multiple cameras into a single forward pass. Keep vision encoder in FP16 — critical for defect pixel-level accuracy. Use text decoder in AWQ 4-bit to fit on 2× RTX 3090. Pre-process images asynchronously to hide latency.

Key insight: For manufacturing/defect applications, never quantize the vision encoder below INT8. False negatives in defect detection are critical failures.
🔧 Edge Deployment (Industrial IoT) Edge
Hardware
Jetson Orin or Raspberry Pi 5
Model
Qwen2.5-1.5B or Phi-3-mini
Quantization
INT4 (TensorRT on Jetson) / GGUF Q4

On-device inference with no cloud dependency. Jetson Orin: use TensorRT INT8 for maximum throughput (supports 7B models at ~10 tokens/sec). Raspberry Pi 5: GGUF Q4 on CPU via llama.cpp — 1–3B models only at ~1–3 tokens/sec. Functional for offline industrial QA assistants.

§ 13
High ValuePractical Tips
🎯
AWQ > GPTQ for new deployments. Unless you have a specific reason (e.g., pre-quantized GPTQ model already available), AWQ produces better perplexity and has faster inference kernels. The gap is ~0.3–0.8 perplexity points in favor of AWQ at 4-bit.
💡
Calibration data matters more than you think. GPTQ/AWQ calibration data should match your use case. Medical LLM? Calibrate on medical text. Code assistant? Use code samples. Domain mismatch in calibration can cost 1–2% task accuracy.
⚠️
Never benchmark tokens/sec on a single request. That measures decode speed, not serving throughput. Use a realistic concurrency level (10–100 simultaneous requests). vLLM's throughput benchmark: python benchmarks/benchmark_throughput.py
🔢
INT8 KV cache is almost always free money. Enable --kv-cache-dtype int8 in vLLM. 50% KV cache memory reduction, <0.3% perplexity increase. This lets you serve longer contexts or more concurrent requests without touching model weights.
📊
For 70B+ models: don't try to squeeze onto fewer GPUs than needed. Running a 70B AWQ model on 2× RTX 3090s barely works but PCIe bandwidth becomes the bottleneck. 2× A100 40GB is far more efficient despite similar raw VRAM.
🚀
Speculative decoding 3B + 70B is often the highest throughput setup for latency-constrained serving. Draft with a fast 3B model, verify with the 70B. Achieves near 70B quality at close to 3B speeds for many query types.
🔄
Prefix caching is your cheapest optimization. For RAG systems with large context documents, APC eliminates re-prefilling the same document chunks repeatedly. Can improve effective throughput by 2–4× in RAG workloads. Structure: [system prompt][retrieved docs][user query].
💰
The 10× cost reduction formula: FP16 on A100 → AWQ 4-bit on 2× RTX 4090. Same model quality, same (or better) throughput, ~10× cheaper GPU rental cost. This is the most practical cost lever available today.
🧪
Always run task-specific evals, not just perplexity. A model can have good perplexity but poor accuracy on your specific task after quantization. For classification tasks, quantization error tends to affect boundary cases first.
🏗️
For fine-tuned models: train QLoRA → merge → re-quantize with AWQ. Never serve the raw QLoRA model with bitsandbytes. The merged+AWQ path gives 2–3× better throughput. The extra conversion step is worth it for any production deployment.
§ 14
Reference CodeCode Snippets
vLLM — Production Inference
Python · vLLM
from vllm import LLM, SamplingParams

# Load AWQ 4-bit model
llm = LLM(
    model="./llama3-8b-awq-4bit",
    quantization="awq",
    tensor_parallel_size=1,      # Number of GPUs
    gpu_memory_utilization=0.90,   # Reserve 10% for safety
    enable_prefix_caching=True,   # APC - huge win for RAG
    kv_cache_dtype="fp8_e5m2",    # INT8 KV cache
    max_model_len=8192,
)

params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)

# Batched inference (always batch for throughput)
prompts = ["Tell me about LLM quantization", "What is AWQ?"]
outputs = llm.generate(prompts, params)

for output in outputs:
    print(output.outputs[0].text)
vLLM OpenAI-Compatible API Server
Bash · vLLM Server
# Start server (AWQ model, 2 GPUs, prefix caching enabled)
vllm serve ./llama3-70b-awq \
  --quantization awq \
  --tensor-parallel-size 2 \
  --enable-prefix-caching \
  --kv-cache-dtype fp8_e5m2 \
  --max-model-len 16384 \
  --host 0.0.0.0 \
  --port 8000

# Query with standard OpenAI client
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3-70b-awq", "messages": [{"role": "user", "content": "Hello"}]}'
Check Model Memory Requirements Before Loading
Python · Memory Estimator
def estimate_vram(param_billions, dtype_bytes=2, kv_overhead=1.3):
    """
    param_billions: model size (e.g. 7 for 7B)
    dtype_bytes: 2=FP16, 1=INT8, 0.5=INT4
    kv_overhead: 1.2-1.4x for KV cache + activations
    """
    weights_gb = ((param_billions * 1e9) * dtype_bytes) / (1024**3)
    return weights_gb * kv_overhead

# Examples
print(f"70B FP16: {estimate_vram(70, 2):.1f} GB")    # ~182 GB
print(f"70B INT8: {estimate_vram(70, 1):.1f} GB")    # ~91 GB
print(f"70B INT4: {estimate_vram(70, 0.5):.1f} GB")  # ~45 GB
print(f"8B INT4:  {estimate_vram(8, 0.5):.1f} GB")   # ~5.2 GB
§ 15
Final ReferenceComparison Summary
Quantization Methods — Final Comparison
MethodBitsAccuracyInference SpeedCPUPrimary UseTooling
FP16/BF1616🟢 Baseline🟡 BaselineDefault GPU servingAll frameworks
FP8 E4M38🟢 ~FP16🟢 ~2× FP16H100 productionvLLM, TensorRT
INT8 (W8A8)8🟢 <1% loss🟢 1.5–2×⚠️Bandwidth-limitedbitsandbytes, TRT
AWQ 4-bit4🟡 1–2% loss🟢 3–4×Production GPU 4-bitAutoAWQ, vLLM
GPTQ 4-bit4🟡 1–2.5% loss🟡 2–3×4-bit GPU inferenceAutoGPTQ, TGI
GGUF Q4_K_M~4.5🟡 ~1.5% loss🟡 CPU-optimizedCPU/local deploymentllama.cpp, Ollama
NF4 (bnb)4🟡 1–2% loss🔴 Slow kernelsQLoRA training onlybitsandbytes
Inference Engine — Final Comparison
EngineBest ForThroughputEase of UseGPU RequiredCPU
vLLMProduction GPU serving🟢 Excellent🟢 Easy
TensorRT-LLMMax throughput NVIDIA🟢 Best🔴 Complex
llama.cppCPU / local / edge🟡 CPU-limited🟢 EasyOptional
HF TransformersResearch / fine-tuning🔴 Poor🟢 BestOptional
TGIHF-integrated production🟡 Good🟢 Good
ONNX RuntimeCross-platform / edge🟡 Medium🟡 MediumOptional
DeepSpeed100B+ models🟡 Good🔴 Complex
Recommended Stack Per Use Case
Use CaseModel SizeQuantizationEngineHardware
Local developer assistant7–13BGGUF Q4_K_MOllamaMac M2+ / Any CPU
Team internal API7–34BAWQ 4-bitvLLM1–2× RTX 3090/4090
Public API (<100 req/s)7–70BAWQ 4-bitvLLM2–4× A100 40GB
High-scale API (>1000 req/s)70BFP8vLLM / TensorRT8× H100
Fine-tuned model (QLoRA)7–13BQLoRA train → AWQ servevLLM1× RTX 3090+ train, A100 serve
RAG application7–34BAWQ 4-bit + APCvLLM (prefix cache ON)1–2× A100
VLM production7–26BVision: FP16, Text: AWQvLLM (VLM) / custom2–4× A100
Edge / IoT1–7BINT4 / GGUF Q4TensorRT / llama.cppJetson Orin / ARM
ExperimentationAnybitsandbytes 4-bitHF TransformersAny GPU
🎯 The Production Decision in Two Steps
Have GPU?
YES →
H100?
YES →
FP8 + vLLM
NO →
AWQ 4-bit + vLLM
Have GPU?
NO →
GGUF Q4_K_M + llama.cpp

In 90% of cases, this two-step decision covers the optimal choice. Layer on refinements (speculative decoding, KV cache tuning, SmoothQuant) only after validating this baseline works for your requirements.