LLM/VLM Quantization & Inference Engineering Handbook

§ 01

Core ConceptsFoundations of Quantization

What is Quantization?

Quantization is the process of reducing the numerical precision of model weights and/or activations from high-precision floating point (FP32/FP16) to lower-precision formats (INT8, INT4, FP8). The goal is to reduce model memory footprint and accelerate matrix multiply operations — the dominant operation in transformer inference — at the cost of some representational accuracy.

Fundamentally: instead of storing a weight as a 32-bit float, you store it as an 8-bit or 4-bit integer, then scale it back during computation using a stored scale factor.

Why Quantization is Required

💾 Memory Wall

A 70B FP16 model needs ~140 GB VRAM. With INT4 it drops to ~35 GB — fits on a single 40GB A100 or 2×RTX 3090s. Without quantization, large models are simply unrunnable on most hardware.

⚡ Throughput Bottleneck

LLM inference is memory-bandwidth bound, not compute bound. Smaller dtypes = less data moved per token = higher throughput. INT8 can be 2× faster than FP16 on bandwidth-saturated hardware.

💰 Cost Reduction

Smaller models fit on cheaper GPUs. A model that required an A100 in FP16 might run on an RTX 4090 in INT4. Cost per token can drop 5–10×. Critical for API-serving businesses.

Core Trade-off Triangle

Lower Precision →	Memory	Latency / Throughput	Accuracy
FP32	🔴 4× baseline	🔴 Slowest	🟢 Full accuracy
FP16 / BF16	🟡 2× baseline	🟡 ~2× faster	🟢 Negligible loss
INT8	🟢 1× baseline	🟢 2–3× faster	🟡 <1% degradation
INT4	🟢 0.5× baseline	🟢 3–4× faster	🟡 1–3% degradation

Quantization vs Pruning vs Distillation

Technique	What it does	Accuracy Cost	Best For
Quantization	Reduces precision of weights/activations	Low (0.5–3%)	Inference speed, VRAM reduction
Pruning	Removes unimportant weights (sets to 0)	Medium (can be high)	Structured sparsity, specialized hardware
Distillation	Trains smaller model to mimic larger one	Depends on size gap	Permanent smaller model, edge deployment
Combined	QAT + pruning + distillation	Controllable	Maximum compression with accuracy budget

Rule: Quantization is almost always the first optimization to apply. It's non-destructive, reversible, and has well-established tooling. Pruning and distillation require retraining.

§ 02

ReferenceTypes of Quantization

Core Floating Point Formats

FP16

16-bit · Half Precision

IEEE 754 half-precision. 1 sign bit, 5 exponent bits, 10 mantissa bits. Dynamic range: ~6×10⁻⁵ to 65504. The default "quantized" format for most GPU inference today.

Memory vs FP32

2× reduction

Hardware

All modern GPUs

Accuracy Loss

Negligible

CUDA Tensor Cores

✅ Fully utilized

Pros

No accuracy loss vs FP32
Native GPU support everywhere
Default for HF Transformers

Cons

Can overflow (max 65504)
Less stable for training
2× FP32 memory still large

Production Default vLLM Native All NVIDIA GPUs

BF16

16-bit · Brain Float

Google Brain's format. Same 1 sign bit, but 8 exponent bits (same as FP32!) and only 7 mantissa bits. Same dynamic range as FP32 but less precision. Critical: no gradient overflow during training.

Dynamic Range

Same as FP32

Hardware

A100, H100, RTX 3090+

vs FP16

More stable training

Best For

Fine-tuning + inference

Rule: Prefer BF16 over FP16 for fine-tuning (no overflow) and for inference on Ampere+ hardware (A100, H100, RTX 3090+). FP16 on older Volta/Turing (V100, T4).

INT8

8-bit · Integer

8-bit integer quantization. Weights and/or activations are mapped to [-128, 127] using scale factors. Two flavors: W8A8 (weights + activations both INT8, needs calibration) and W8A16 (only weights INT8, activations stay FP16 — simpler, more common).

Memory vs FP16

2× reduction

Throughput Gain

1.5–2.5× faster

Accuracy Loss

<1% (W8A16)

Calibration

Needed for W8A8

Pros

Strong accuracy preservation
Widely supported
Excellent for production APIs

Cons

W8A8 needs calibration data
Some activations hard to quantize
Outlier problem (see SmoothQuant)

INT4 / 4-bit

4-bit · Integer

Maps weights to 16 discrete values [-8, 7]. Groups of 64–128 weights share a scale factor (group quantization). Almost always W4A16 (activations stay FP16). The dominant format for consumer GPU inference today.

Memory vs FP16

4× reduction

Accuracy Loss

1–3% (model-dependent)

Group Size

g=128 typical

Sweet Spot

7B–70B models

Warning: Small models (<3B) degrade significantly with INT4. Larger models (70B+) are surprisingly robust. Always benchmark on your task before deploying INT4 in production.

Consumer GPU Sweet Spot AWQ / GPTQ / GGUF

FP8

8-bit · Float (E4M3 / E5M2)

Two FP8 variants: E4M3 (4 exponent, 3 mantissa — better for weights/activations) and E5M2 (5 exponent, 2 mantissa — better for gradients). Requires H100 or Ada Lovelace (RTX 4090) hardware. Native CUDA support in H100 Transformer Engine.

Memory vs FP16

2× reduction

vs INT8

Better accuracy

Hardware

H100, RTX 4090 only

vLLM support

✅ With --dtype fp8

Rule: If you have H100s, FP8 is the best inference format — better accuracy than INT8 at the same memory, with native hardware acceleration. On older hardware, fall back to INT8/INT4.

Advanced Quantization Methods

GPTQ

Post-Training · 4-bit / 8-bit

Generative Pre-trained Transformer Quantization. Uses second-order weight information (Hessians) to minimally perturb weights during quantization. Offline quantization — run once, load the result. Produces W4A16 models. Industry standard for 4-bit inference.

Quantization Time

Minutes to hours (offline)

Accuracy vs raw INT4

Significantly better

Calibration Data

~128 samples needed

Inference Engine

vLLM, TGI, AutoGPTQ

Pros

High accuracy preservation
Mature tooling (AutoGPTQ)
Pre-quantized models on HF Hub

Cons

Slow offline quantization
Less efficient than AWQ at inference time
Slightly worse perplexity than AWQ

AutoGPTQ TheBloke models vLLM supported

AWQ

Activation-Aware Weight Quantization

Key insight: Not all weights are equal — weights corresponding to high-activation channels cause disproportionate quantization error. AWQ identifies these "salient" channels by looking at activation magnitudes, then either keeps them in higher precision or scales them pre-quantization. Results in significantly better accuracy than GPTQ at same bit-width.

vs GPTQ

Better perplexity

Inference Speed

Faster (efficient kernels)

Calibration

Small unlabeled dataset

Hardware

NVIDIA GPU required

Rule: AWQ is the preferred 4-bit format for production GPU serving. Use AutoAWQ for quantization. Load with vLLM or TGI. Better quality AND faster than GPTQ in most benchmarks.

Preferred for vLLM AutoAWQ Production Recommended

GGUF

llama.cpp format · CPU/GPU

GGUF (GGML Unified Format) — the file format and quantization scheme used by llama.cpp. Supports a range of quantization levels: Q2_K, Q3_K_S/M/L, Q4_K_S/M, Q5_K_S/M, Q6_K, Q8_0. Uses k-quants with per-block scales and mixed precision within a block. CPU-first design.

CPU inference

✅ First-class

GPU offload

Partial layers

Best format

Q4_K_M (accuracy/speed)

Apple Silicon

✅ MPS acceleration

Q4_K_M is the sweet spot — best quality-to-size ratio for GGUF. Use Q5_K_M if you have extra RAM and want near-FP16 quality.

CPU Deployment Mac M-series llama.cpp / Ollama

bitsandbytes

8-bit / 4-bit · NF4

Tim Dettmers' library. 8-bit uses vector-wise quantization with mixed 8-bit precision for outliers. 4-bit uses NF4 (NormalFloat4) — an information-theoretically optimal 4-bit type for normally distributed weights. Primary use-case: QLoRA fine-tuning. For inference, AWQ/GPTQ have better kernel performance.

NF4 vs INT4

Better for normal dists

Primary Use

QLoRA training

Inference Speed

⚠️ Slower than AWQ

Double Quant

Quantize scale factors too

Warning: bitsandbytes is not optimized for pure inference throughput. Use it for QLoRA training or quick experiments. In production serving, prefer AWQ or GPTQ quantized models.

🔄 QLoRA — Training vs Inference Distinction

QLoRA training: Base model loaded in 4-bit NF4 (frozen). LoRA adapters trained in BF16. This dramatically reduces training VRAM. The 4-bit base is ONLY used to compute gradients for the adapters.

After QLoRA training: You have a base 4-bit model + LoRA adapter weights in BF16. For inference, you can: (a) serve base model + merge adapters at runtime, or (b) dequantize → merge → re-quantize with AWQ/GPTQ for production. Option (b) is better for serving.

Critical: QLoRA-trained models in raw bitsandbytes format are NOT efficient for serving. Always merge and re-export for production.

🔧 SmoothQuant & ZeroQuant

SmoothQuant

Tackles the outlier problem in activations: some activation channels are 100× larger than others, making them hard to quantize. SmoothQuant mathematically migrates the quantization difficulty from activations to weights (which are easier to quantize) by introducing a per-channel scale factor. Enables W8A8 quantization with minimal accuracy loss.

W8A8 enablerTensorRT-LLM

ZeroQuant

Microsoft's method for W8A8 at the operator level. Uses token-wise quantization for activations and weight-wise for weights. Integrated into DeepSpeed. Supports INT4/INT8 with hardware-aware quantization kernels. Part of the ZeroQuant-V2 and ZeroQuant-FP extensions.

DeepSpeedW8A8

§ 03

Decision GuideQuantization Decision Framework

Hardware → Format Rules

CPU only

→

GGUF Q4_K_M

via

llama.cpp / Ollama

Mac M1/M2/M3

→

GGUF Q4_K_M or Q5_K_M

via

llama.cpp (MPS)

RTX 3090 / 4090 (24GB)

→

AWQ 4-bit or FP16

via

vLLM

A100 / A6000 (40-80GB)

→

FP16 / BF16 or AWQ

via

vLLM

H100 (80GB)

→

FP8 E4M3

via

vLLM / TensorRT-LLM

Multi-GPU (2–8× A100)

→

FP16 + Tensor Parallel

via

vLLM (--tensor-parallel-size N)

Edge / Jetson / Mobile

→

INT4 or GGUF Q4

via

ONNX Runtime / llama.cpp

Use-Case Rules

High throughput API (>1000 req/s): FP8 on H100s with vLLM. Enable continuous batching + prefix caching. This is non-negotiable for cost efficiency at scale.

Latency-critical (<50ms TTFT): FP16 or BF16 — quantization usually doesn't help latency for single requests, only throughput. Prefill optimization matters more.

Fitting model on consumer GPU: AWQ 4-bit. A 70B model fits on 2× RTX 3090. A 13B model fits on a single RTX 3080 10GB. A 7B model on RTX 3060 12GB.

Fine-tuned model deployment: Merge LoRA → full model → quantize with AWQ → serve with vLLM. Never serve raw bitsandbytes models at scale.

Accuracy is critical (medical/legal): FP16 minimum. If you must quantize, use INT8 with SmoothQuant calibration and always run task-specific evals before deploying.

VLM (vision-language model): Quantize text decoder aggressively (INT4/AWQ), but be conservative with vision encoder (FP16 or INT8 at minimum). Vision encoder activations are more sensitive.

Model Size × VRAM Matrix

Model Size	FP16 VRAM	INT8 VRAM	INT4 VRAM	Min GPU (INT4)
3B	6 GB	3 GB	1.8 GB	RTX 3060 (8GB)
7B	14 GB	7 GB	4 GB	RTX 3060 12GB
13B	26 GB	13 GB	7 GB	RTX 3080 10GB
34B	68 GB	34 GB	17 GB	2× RTX 3090
70B	140 GB	70 GB	35 GB	2× A100 40GB / 2× RTX 3090
405B	810 GB	405 GB	202 GB	8× A100 80GB (INT4)

Note: Add ~20–30% for KV cache overhead at serving time. These are weights-only estimates.

§ 04

ReferenceHardware Compatibility Matrix

Hardware	Best Quant	FP8	INT8	INT4	GGUF	Notes
H100 80GB SXM	FP8 E4M3	✅	✅	✅	❌	Native FP8 Tensor Cores. Best-in-class inference. Use TensorRT-LLM or vLLM.
A100 80GB	FP16 / AWQ	⚠️	✅	✅	❌	No native FP8. BF16 Tensor Cores. Workhorse for production LLM serving.
RTX 4090 24GB	AWQ 4-bit	⚠️	✅	✅	✅	Ada Lovelace. Software FP8 support. Great value for single-GPU inference.
RTX 3090 24GB	AWQ 4-bit / INT8	❌	✅	✅	✅	Ampere. BF16 supported. Popular for local 70B (2×) or 13B single GPU serving.
RTX 3080 10GB	AWQ 4-bit	❌	⚠️	✅	✅	10GB limits to 7B-13B INT4. Tight for INT8 on 13B.
T4 16GB (Cloud)	INT8	❌	✅	✅	✅	Turing. No BF16. Common in GCP/AWS for cheap inference. FP16 inference.
V100 16/32GB	INT8	❌	✅	⚠️	❌	Volta. No BF16, no INT4 Tensor Cores. Use bitsandbytes INT8 carefully.
Apple M2/M3 (MPS)	GGUF Q4_K_M	❌	⚠️	✅	✅	Unified memory. llama.cpp with MPS. Outstanding for local dev / 7B-13B models.
CPU (x86)	GGUF Q4_K_M	❌	⚠️	⚠️	✅	AVX2/AVX-512 for acceleration. llama.cpp. Slow but cost-effective for low traffic.
Jetson Orin (Edge)	INT4 / GGUF	❌	✅	✅	✅	Ampere GPU + ARM. Use TensorRT for INT8. GGUF Q4 for flexibility.

§ 05

ImplementationQuantization Methods — Code & Usage

5.1 bitsandbytes — 4-bit & 8-bit Loading

Python · bitsandbytes

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 4-bit NF4 (QLoRA style — for fine-tuning or quick experiments)
bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NF4 for normally-distributed weights
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16 during forward pass
    bnb_4bit_use_double_quant=True,      # Quantize scale factors too (~0.4 bits extra saving)
)

# 8-bit INT8 (better for inference accuracy)
bnb_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,  # Outlier threshold — weights above go to FP16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8b-hf",
    quantization_config=bnb_config_4bit,
    device_map="auto",  # Automatically distributes across available GPUs
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b-hf")

5.2 GPTQ — Offline Quantization

Python · AutoGPTQ

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

# Step 1: Quantize (run once offline, save result)
quantize_config = BaseQuantizeConfig(
    bits=4,         # 4-bit quantization
    group_size=128,  # Larger = better quality, slightly more VRAM
    desc_act=False,  # False = faster inference; True = better accuracy
)

model = AutoGPTQForCausalLM.from_pretrained("meta-llama/Llama-3-8b-hf", quantize_config)
model.quantize(calibration_examples)  # List of tokenized text samples (~128)
model.save_quantized("./llama3-8b-gptq-4bit")

# Step 2: Load the quantized model for inference
model = AutoGPTQForCausalLM.from_quantized(
    "./llama3-8b-gptq-4bit",
    device_map="auto",
    use_triton=True,  # Faster Triton kernels if available
)

5.3 AWQ — Activation-Aware Quantization

Python · AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# Quantize offline
model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3-8b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b-hf")

quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized("./llama3-8b-awq-4bit")

# Load in vLLM (preferred serving path)
# vllm serve ./llama3-8b-awq-4bit --quantization awq

5.4 GGUF — Convert & Run on CPU

Bash · llama.cpp conversion

# Step 1: Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j8  # or: cmake -B build && cmake --build build --config Release

# Step 2: Convert HF model to GGUF F16
python convert_hf_to_gguf.py /path/to/llama3-8b-hf --outtype f16 --outfile llama3-8b-f16.gguf

# Step 3: Quantize to Q4_K_M (recommended)
./llama-quantize llama3-8b-f16.gguf llama3-8b-q4km.gguf Q4_K_M

# Step 4: Run inference (CPU)
./llama-cli -m llama3-8b-q4km.gguf -p "Explain quantization in one paragraph:" -n 256

# With partial GPU offload (e.g. 20 layers to GPU)
./llama-cli -m llama3-8b-q4km.gguf -ngl 20 -p "Hello"

# Serve as OpenAI-compatible API
./llama-server -m llama3-8b-q4km.gguf --port 8080

5.5 FP8 with vLLM

Bash · vLLM FP8 (H100 required)

# Static FP8 quantization via vLLM's quantization toolkit
python -m llmcompressor.transformers.compression.compress \
  --model meta-llama/Llama-3-70b-hf \
  --recipe fp8_recipe.yaml \
  --save_dir ./llama3-70b-fp8

# Or: dynamic FP8 (no calibration needed, slightly lower quality)
vllm serve meta-llama/Llama-3-70b-hf \
  --dtype float16 \
  --quantization fp8 \
  --tensor-parallel-size 4

§ 06

Critical SectionInference Engines

⚡

vLLM

UC Berkeley · Production GPU LLM Serving

The default choice for production GPU inference. Built around PagedAttention for near-zero KV cache waste, continuous batching, and a growing list of quantization support. OpenAI-compatible API out of the box. Python-native, easy to deploy.

Strengths

PagedAttention (near-zero memory waste)
Continuous batching
AWQ, GPTQ, FP8, INT8 support
Tensor parallelism built-in
OpenAI API compatible

Weaknesses

No CPU inference
NVIDIA GPU only (AMD experimental)
Less tunable than TensorRT
Not ideal for very small models

FP16 / BF16AWQGPTQFP8INT8

🚀

TensorRT-LLM

NVIDIA · Maximum GPU Throughput

NVIDIA's production-grade engine. Compiles models into TensorRT graphs with fused kernels, optimal memory layouts, and hardware-specific optimizations. Highest raw throughput on NVIDIA hardware, but complex to set up and inflexible (model format locked to TRT).

Strengths

Maximum throughput on NVIDIA
FP8, INT8, SmoothQuant native
Speculative decoding, LoRA
Production-tested by NVIDIA

Weaknesses

Complex build pipeline
NVIDIA-only, CUDA-locked
Slow model compilation step
Less flexible than vLLM

FP8INT8SmoothQuantNVIDIA Only

🦙

llama.cpp

Georgi Gerganov · CPU-First Inference

The CPU inference champion. Pure C++ with AVX/AVX2/AVX-512 and Apple MPS support. GGUF format with k-quants (Q2_K through Q8_0). Supports partial GPU offload. Default choice for local/CPU/edge deployments. Powers Ollama, LM Studio, Jan.

Strengths

CPU and Apple Silicon native
Partial GPU offload
OpenAI API via llama-server
Wide model support

Weaknesses

Not for high-throughput GPU serving
GGUF format only
No continuous batching

🤗

HuggingFace Transformers

Experimentation & Fine-tuning

The research and experimentation standard. Not optimized for production throughput but unmatched in flexibility and model coverage. Use for: fine-tuning, custom architectures, research, quick testing. Integrates bitsandbytes, GPTQ, and AWQ via quantization configs. Not for production serving at scale.

Production warning: HF Transformers has no continuous batching, poor memory management, and naive sequential batching. At scale (>10 req/s), switch to vLLM or TGI.

🏭

TGI (Text Generation Inference)

HuggingFace · Production Serving

HuggingFace's production inference server. Continuous batching, tensor parallelism, GPTQ/AWQ support. Slightly less throughput than vLLM on benchmarks but very well-integrated with HF Hub models. Good choice if your stack is HF-centric.

FP16GPTQAWQHF Hub native

🔷

ONNX Runtime

Microsoft · Cross-Platform Edge/Cloud

Export-compile-deploy pipeline. Best for: cross-platform deployment (Windows, iOS, Android, WebAssembly), small models, edge devices. Supports INT8 quantization and DirectML for non-NVIDIA GPUs. Not ideal for large autoregressive LLMs — optimized for encoder models and smaller decoders.

Edge DevicesINT8DirectML

🌊

DeepSpeed Inference

Microsoft · Multi-GPU Large Models

Microsoft's inference optimization library. Kernel injection for fused ops, ZeroQuant INT4/INT8 quantization, and tensor parallelism. Best for very large models (>100B) on multi-GPU clusters where model doesn't fit in standard configurations. Also used for DeepSpeed-Chat serving pipeline.

100B+ ModelsZeroQuantMulti-GPU

§ 07

Reference MatrixEngine × Quantization Compatibility

Engine	FP32	FP16/BF16	INT8	INT4	FP8	GPTQ	AWQ	GGUF	NF4 (bnb)
vLLM	❌	✅	✅	✅	✅	✅	✅	❌	⚠️
TensorRT-LLM	❌	✅	✅	✅	✅	✅	✅	❌	❌
llama.cpp	❌	⚠️	⚠️	✅	❌	❌	❌	✅	❌
HF Transformers	✅	✅	✅	✅	⚠️	✅	✅	❌	✅
TGI	❌	✅	✅	✅	⚠️	✅	✅	❌	❌
ONNX Runtime	✅	✅	✅	⚠️	❌	❌	❌	❌	❌
DeepSpeed	✅	✅	✅	✅	⚠️	❌	❌	❌	❌

✅ Native support ⚠️ Partial / experimental ❌ Not supported

§ 08

Critical DifferencesLLM vs VLM Quantization

Why VLM Quantization is Harder

Vision-Language Models (LLaVA, Qwen-VL, InternVL, Idefics, etc.) consist of at minimum two components: a vision encoder (ViT-based) and a text decoder (transformer LLM). These have fundamentally different quantization characteristics.

🖼️ Vision Encoder

Processes raw pixel values → patch embeddings
Activations have high variance and extreme outliers
Many operations are not matmul-dominated (Conv2d, LayerNorm)
INT4 causes severe visual distortion artifacts
FP16 or INT8 with careful calibration is the minimum
CLIP encoders particularly sensitive

📝 Text Decoder

Standard autoregressive transformer
Can be aggressively quantized (INT4/AWQ)
Larger proportion of model VRAM
Standard LLM quantization techniques apply
AWQ or GPTQ 4-bit works well here

Critical rule: Never apply INT4 to vision encoders in production. The quantization error propagates into multimodal embeddings and causes hallucinations that are hard to detect — the model may confidently describe a completely different image.

Recommended VLM Quantization Strategy

Vision Encoder

→

Keep FP16 or INT8 with SmoothQuant

Cross-Attention / MLP Projector

→

INT8 W8A16 (conservative)

Text Decoder LLM

→

AWQ 4-bit or INT8

Mixed precision total

→

~60% memory savings vs full FP16

Common VLM Quantization Failure Modes

Failure Mode	Root Cause	Fix
Wrong object identification	Vision encoder INT4 causes embedding distortion	Upgrade vision encoder to FP16
Color/count errors	Patch embedding quantization error	Use INT8 for vision encoder, calibrate
Coherent text but wrong image	Projector layer quantized aggressively	Keep projector in FP16
Increased hallucination rate	Multimodal embedding space corrupted	Run visual QA benchmarks (VQAv2) post-quant
Slow inference despite quantization	Mixed precision dequantization overhead	Profile with nsight; batch vision pre-processing

§ 09

End-to-EndDeployment Pipelines

Pipeline A: GPTQ → vLLM Production Serving

1

Prepare Calibration Dataset

Select ~128 representative text samples from your use case. Domain-specific data gives better calibration than generic data.

2

Run GPTQ Quantization

auto-gptq quantize --bits 4 --group-size 128 --model llama3-70b --output ./llama3-70b-gptq4b. Takes 1–4 hours on A100.

3

Validate Perplexity

Measure WikiText-2 perplexity of quantized vs original. Accept if delta <0.5 for most models. Run task-specific evals (HellaSwag, MMLU) for accuracy-critical apps.

4

Serve with vLLM

vllm serve ./llama3-70b-gptq4b --quantization gptq --tensor-parallel-size 2 --port 8000

5

Load Test

Use locust or wrk to verify throughput under expected load. Monitor GPU memory and token/sec.

Pipeline B: GGUF for Local CPU/Mac Deployment

1

Download HF Model

huggingface-cli download meta-llama/Llama-3-8b-hf --local-dir ./llama3-8b

2

Convert to GGUF FP16

python llama.cpp/convert_hf_to_gguf.py ./llama3-8b --outtype f16 --outfile llama3-8b-f16.gguf

3

Quantize to Q4_K_M

./llama-quantize llama3-8b-f16.gguf llama3-8b-q4km.gguf Q4_K_M. ~4GB output for 8B model.

4

Serve with llama-server

./llama-server -m llama3-8b-q4km.gguf -c 4096 --port 8080 -ngl 99 (ngl=layers on GPU, 99=all)

Pipeline C: INT8 + TensorRT-LLM on H100

1

Calibrate with SmoothQuant

Run SmoothQuant calibration to migrate outliers from activations to weights. Requires 512 calibration samples.

2

Build TensorRT Engine

python convert_checkpoint.py --model_dir llama3-70b --dtype float16 --use_smooth_quant --int8_kv_cache

3

Compile with trtllm-build

trtllm-build --checkpoint_dir ./trt_ckpts --output_dir ./trt_engines --gemm_plugin float16. This step takes 20–60 min.

4

Deploy via Triton Server

Serve TRT engine via NVIDIA Triton Inference Server with TensorRT-LLM backend for production scale.

§ 10

PerformanceOptimization Techniques

📦 KV Cache Optimization

KV cache = the biggest VRAM consumer after weights. For 70B, FP16 KV cache can exceed model weights at long contexts.

PagedAttention (vLLM): Non-contiguous KV blocks, near-zero waste, enables sharing for parallel sampling
INT8 KV cache: --kv-cache-dtype int8 in vLLM. 50% KV cache memory reduction, <0.5% quality loss
KV cache quantization: per-token scale factors for activations; more accurate than per-tensor
Context length management: Set --max-model-len conservatively. Longer context = quadratic KV growth

🔄 Continuous Batching

The key throughput unlock. Unlike static batching (wait for all requests, process together), continuous batching slots in new requests as soon as GPU compute is free.

Eliminates idle GPU time between requests
Supported in vLLM, TGI, TensorRT-LLM natively
3–10× throughput improvement over naive batching
Critical for mixed-length workloads

⏱️ Prefill vs Decode Optimization

Two distinct phases with different bottlenecks:

Prefill: compute-bound (processes entire prompt in one pass). Optimize with chunked prefill to overlap with decode.
Decode: memory-bandwidth bound (one token at a time). Optimize with batching (more sequences = better GPU utilization) and quantization.
Chunked prefill (vLLM v1): Break long prompts into chunks, interleave with decode steps to avoid decode stalls.

🎯 Speculative Decoding

Use a small draft model (e.g. 1B) to speculatively generate 3–5 tokens, then verify with the target model in a single forward pass. Effective for latency-bound workloads.

Reduces decode latency 1.5–2.5× for appropriate tasks
No accuracy loss (mathematically equivalent)
Supported in vLLM, TensorRT-LLM
Works best for predictable output distributions

Prefix Caching (APC)

Store and reuse KV cache for repeated prompt prefixes (system prompts, RAG context). In vLLM, Automatic Prefix Caching (APC) is enabled by default in v1. Eliminates redundant prefill computation for requests sharing a long common prefix. For chatbots with a fixed system prompt, this can reduce effective compute by 40–60%.

Production rule: Always use APC in vLLM. Structure your prompts to front-load the shared prefix (system prompt + RAG docs) before the user message. Ensure prompt stability — even small changes break cache hits.

§ 11

TroubleshootingDebugging & Failure Modes

Issue	Likely Cause	Diagnosis	Fix
Significant accuracy drop after INT4	Model too small, or high-variance activations	Compare perplexity before/after. Run task evals.	Switch to INT8 or AWQ with smaller group size (g=64)
OOM during quantization	Full model loaded in FP16 before quantization	nvidia-smi, torch memory profiler	Quantize layer-by-layer, use CPU offload during GPTQ
OOM during serving	KV cache overflows, batch too large	Check vLLM logs for KV cache usage	Reduce --max-model-len, add --gpu-memory-utilization 0.85, use INT8 KV cache
Slow inference despite GPU	Memory bandwidth saturation, small batch size	Profile with nsight; check tokens/sec per GPU	Increase batch size, enable tensor parallelism, check for CPU-GPU transfer bottlenecks
Repetitive or incoherent output	Quantization degraded attention mechanisms	Compare outputs vs FP16 on same prompts	Increase precision (INT8→FP16), check if attention layers are quantized
GGUF slower than expected on CPU	Missing AVX2/AVX-512, wrong thread count	`./llama-cli --help` check BLAS backend	Compile with `GGML_AVX2=ON`, set `-t` to physical cores
VLM seeing wrong image content	Vision encoder over-quantized	Test with known images, compare to FP16 baseline	Keep vision encoder in FP16, only quantize text decoder
vLLM prefix cache miss rate high	Prompt instability, timestamp/dynamic content in prefix	Check vLLM metrics endpoint for cache hit rate	Move dynamic content to end of prompt, stabilize system prompt
TensorRT build fails	CUDA version mismatch, unsupported op	TRT build logs, check CUDA compute capability	Match TRT version to CUDA driver, verify model is in supported list

§ 12

Applied ExamplesReal-World Scenarios

🖥️ Low-Cost Chatbot on CPU CPU / Edge

Model

Llama-3-8B or Mistral-7B

Quantization

GGUF Q4_K_M

Engine

llama.cpp / Ollama

Deploy on a $5/month VPS (4 cores, 8GB RAM) or even a local Mac Mini. ~2–5 tokens/sec on CPU. Suitable for <5 concurrent users. Ollama gives a one-command server. Total infra cost: near zero.

~3.5GB RAMNo GPU$0 GPU cost

⚡ High-Throughput API (1000+ req/s) Production Scale

Model

Llama-3-70B or Mixtral 8×7B

Quantization

FP8 (H100) or AWQ 4-bit (A100)

Engine

vLLM + tensor parallelism

8× H100 cluster with vLLM in FP8. Continuous batching + prefix caching. INT8 KV cache. Target: 5,000 tokens/sec/GPU. Use a load balancer across multiple vLLM instances. Horizontal scaling with Kubernetes. With FP8 on H100: ~40% cost reduction vs FP16 on A100s.

Bash

vllm serve meta-llama/Llama-3-70B \
  --tensor-parallel-size 8 \
  --dtype float16 \
  --quantization fp8 \
  --enable-prefix-caching \
  --kv-cache-dtype fp8_e5m2 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

📷 Multi-Camera VLM System VLM Production

Model

InternVL-2-26B or LLaVA-1.6

Quantization

Vision: FP16, Text: AWQ 4-bit

Engine

vLLM (VLM support) or custom

Smart factory use case: 8 cameras feed images to VLM for defect detection. Batch images from multiple cameras into a single forward pass. Keep vision encoder in FP16 — critical for defect pixel-level accuracy. Use text decoder in AWQ 4-bit to fit on 2× RTX 3090. Pre-process images asynchronously to hide latency.

Key insight: For manufacturing/defect applications, never quantize the vision encoder below INT8. False negatives in defect detection are critical failures.

🔧 Edge Deployment (Industrial IoT) Edge

Hardware

Jetson Orin or Raspberry Pi 5

Model

Qwen2.5-1.5B or Phi-3-mini

Quantization

INT4 (TensorRT on Jetson) / GGUF Q4

On-device inference with no cloud dependency. Jetson Orin: use TensorRT INT8 for maximum throughput (supports 7B models at ~10 tokens/sec). Raspberry Pi 5: GGUF Q4 on CPU via llama.cpp — 1–3B models only at ~1–3 tokens/sec. Functional for offline industrial QA assistants.

§ 13

High ValuePractical Tips

🎯

AWQ > GPTQ for new deployments. Unless you have a specific reason (e.g., pre-quantized GPTQ model already available), AWQ produces better perplexity and has faster inference kernels. The gap is ~0.3–0.8 perplexity points in favor of AWQ at 4-bit.

💡

Calibration data matters more than you think. GPTQ/AWQ calibration data should match your use case. Medical LLM? Calibrate on medical text. Code assistant? Use code samples. Domain mismatch in calibration can cost 1–2% task accuracy.

⚠️

Never benchmark tokens/sec on a single request. That measures decode speed, not serving throughput. Use a realistic concurrency level (10–100 simultaneous requests). vLLM's throughput benchmark: python benchmarks/benchmark_throughput.py

🔢

INT8 KV cache is almost always free money. Enable --kv-cache-dtype int8 in vLLM. 50% KV cache memory reduction, <0.3% perplexity increase. This lets you serve longer contexts or more concurrent requests without touching model weights.

📊

For 70B+ models: don't try to squeeze onto fewer GPUs than needed. Running a 70B AWQ model on 2× RTX 3090s barely works but PCIe bandwidth becomes the bottleneck. 2× A100 40GB is far more efficient despite similar raw VRAM.

🚀

Speculative decoding 3B + 70B is often the highest throughput setup for latency-constrained serving. Draft with a fast 3B model, verify with the 70B. Achieves near 70B quality at close to 3B speeds for many query types.

🔄

Prefix caching is your cheapest optimization. For RAG systems with large context documents, APC eliminates re-prefilling the same document chunks repeatedly. Can improve effective throughput by 2–4× in RAG workloads. Structure: [system prompt][retrieved docs][user query].

💰

The 10× cost reduction formula: FP16 on A100 → AWQ 4-bit on 2× RTX 4090. Same model quality, same (or better) throughput, ~10× cheaper GPU rental cost. This is the most practical cost lever available today.

🧪

Always run task-specific evals, not just perplexity. A model can have good perplexity but poor accuracy on your specific task after quantization. For classification tasks, quantization error tends to affect boundary cases first.

🏗️

For fine-tuned models: train QLoRA → merge → re-quantize with AWQ. Never serve the raw QLoRA model with bitsandbytes. The merged+AWQ path gives 2–3× better throughput. The extra conversion step is worth it for any production deployment.

§ 14

Reference CodeCode Snippets

vLLM — Production Inference

Python · vLLM

from vllm import LLM, SamplingParams

# Load AWQ 4-bit model
llm = LLM(
    model="./llama3-8b-awq-4bit",
    quantization="awq",
    tensor_parallel_size=1,      # Number of GPUs
    gpu_memory_utilization=0.90,   # Reserve 10% for safety
    enable_prefix_caching=True,   # APC - huge win for RAG
    kv_cache_dtype="fp8_e5m2",    # INT8 KV cache
    max_model_len=8192,
)

params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)

# Batched inference (always batch for throughput)
prompts = ["Tell me about LLM quantization", "What is AWQ?"]
outputs = llm.generate(prompts, params)

for output in outputs:
    print(output.outputs[0].text)

vLLM OpenAI-Compatible API Server

Bash · vLLM Server

# Start server (AWQ model, 2 GPUs, prefix caching enabled)
vllm serve ./llama3-70b-awq \
  --quantization awq \
  --tensor-parallel-size 2 \
  --enable-prefix-caching \
  --kv-cache-dtype fp8_e5m2 \
  --max-model-len 16384 \
  --host 0.0.0.0 \
  --port 8000

# Query with standard OpenAI client
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3-70b-awq", "messages": [{"role": "user", "content": "Hello"}]}'

Check Model Memory Requirements Before Loading

Python · Memory Estimator

def estimate_vram(param_billions, dtype_bytes=2, kv_overhead=1.3):
    """
    param_billions: model size (e.g. 7 for 7B)
    dtype_bytes: 2=FP16, 1=INT8, 0.5=INT4
    kv_overhead: 1.2-1.4x for KV cache + activations
    """
    weights_gb = ((param_billions * 1e9) * dtype_bytes) / (1024**3)
    return weights_gb * kv_overhead

# Examples
print(f"70B FP16: {estimate_vram(70, 2):.1f} GB")    # ~182 GB
print(f"70B INT8: {estimate_vram(70, 1):.1f} GB")    # ~91 GB
print(f"70B INT4: {estimate_vram(70, 0.5):.1f} GB")  # ~45 GB
print(f"8B INT4:  {estimate_vram(8, 0.5):.1f} GB")   # ~5.2 GB

§ 15

Final ReferenceComparison Summary

Quantization Methods — Final Comparison

Method	Bits	Accuracy	Inference Speed	CPU	Primary Use	Tooling
FP16/BF16	16	🟢 Baseline	🟡 Baseline	❌	Default GPU serving	All frameworks
FP8 E4M3	8	🟢 ~FP16	🟢 ~2× FP16	❌	H100 production	vLLM, TensorRT
INT8 (W8A8)	8	🟢 <1% loss	🟢 1.5–2×	⚠️	Bandwidth-limited	bitsandbytes, TRT
AWQ 4-bit	4	🟡 1–2% loss	🟢 3–4×	❌	Production GPU 4-bit	AutoAWQ, vLLM
GPTQ 4-bit	4	🟡 1–2.5% loss	🟡 2–3×	❌	4-bit GPU inference	AutoGPTQ, TGI
GGUF Q4_K_M	~4.5	🟡 ~1.5% loss	🟡 CPU-optimized	✅	CPU/local deployment	llama.cpp, Ollama
NF4 (bnb)	4	🟡 1–2% loss	🔴 Slow kernels	❌	QLoRA training only	bitsandbytes

Inference Engine — Final Comparison

Engine	Best For	Throughput	Ease of Use	GPU Required	CPU
vLLM	Production GPU serving	🟢 Excellent	🟢 Easy	✅	❌
TensorRT-LLM	Max throughput NVIDIA	🟢 Best	🔴 Complex	✅	❌
llama.cpp	CPU / local / edge	🟡 CPU-limited	🟢 Easy	Optional	✅
HF Transformers	Research / fine-tuning	🔴 Poor	🟢 Best	Optional	✅
TGI	HF-integrated production	🟡 Good	🟢 Good	✅	❌
ONNX Runtime	Cross-platform / edge	🟡 Medium	🟡 Medium	Optional	✅
DeepSpeed	100B+ models	🟡 Good	🔴 Complex	✅	❌

Recommended Stack Per Use Case

Use Case	Model Size	Quantization	Engine	Hardware
Local developer assistant	7–13B	GGUF Q4_K_M	Ollama	Mac M2+ / Any CPU
Team internal API	7–34B	AWQ 4-bit	vLLM	1–2× RTX 3090/4090
Public API (<100 req/s)	7–70B	AWQ 4-bit	vLLM	2–4× A100 40GB
High-scale API (>1000 req/s)	70B	FP8	vLLM / TensorRT	8× H100
Fine-tuned model (QLoRA)	7–13B	QLoRA train → AWQ serve	vLLM	1× RTX 3090+ train, A100 serve
RAG application	7–34B	AWQ 4-bit + APC	vLLM (prefix cache ON)	1–2× A100
VLM production	7–26B	Vision: FP16, Text: AWQ	vLLM (VLM) / custom	2–4× A100
Edge / IoT	1–7B	INT4 / GGUF Q4	TensorRT / llama.cpp	Jetson Orin / ARM
Experimentation	Any	bitsandbytes 4-bit	HF Transformers	Any GPU

🎯 The Production Decision in Two Steps

Have GPU?

YES →

H100?

YES →

FP8 + vLLM

NO →

AWQ 4-bit + vLLM

Have GPU?

NO →

GGUF Q4_K_M + llama.cpp

In 90% of cases, this two-step decision covers the optimal choice. Layer on refinements (speculative decoding, KV cache tuning, SmoothQuant) only after validating this baseline works for your requirements.

LLM / VLM Quantization& Inference Handbook

SmoothQuant

ZeroQuant

Hardware → Format Rules

LLM / VLM Quantization
& Inference Handbook