Quantization is the process of reducing the numerical precision of model weights and/or activations from high-precision floating point (FP32/FP16) to lower-precision formats (INT8, INT4, FP8). The goal is to reduce model memory footprint and accelerate matrix multiply operations — the dominant operation in transformer inference — at the cost of some representational accuracy.
Fundamentally: instead of storing a weight as a 32-bit float, you store it as an 8-bit or 4-bit integer, then scale it back during computation using a stored scale factor.
A 70B FP16 model needs ~140 GB VRAM. With INT4 it drops to ~35 GB — fits on a single 40GB A100 or 2×RTX 3090s. Without quantization, large models are simply unrunnable on most hardware.
LLM inference is memory-bandwidth bound, not compute bound. Smaller dtypes = less data moved per token = higher throughput. INT8 can be 2× faster than FP16 on bandwidth-saturated hardware.
Smaller models fit on cheaper GPUs. A model that required an A100 in FP16 might run on an RTX 4090 in INT4. Cost per token can drop 5–10×. Critical for API-serving businesses.
| Lower Precision → | Memory | Latency / Throughput | Accuracy |
|---|---|---|---|
| FP32 | 🔴 4× baseline | 🔴 Slowest | 🟢 Full accuracy |
| FP16 / BF16 | 🟡 2× baseline | 🟡 ~2× faster | 🟢 Negligible loss |
| INT8 | 🟢 1× baseline | 🟢 2–3× faster | 🟡 <1% degradation |
| INT4 | 🟢 0.5× baseline | 🟢 3–4× faster | 🟡 1–3% degradation |
| Technique | What it does | Accuracy Cost | Best For |
|---|---|---|---|
| Quantization | Reduces precision of weights/activations | Low (0.5–3%) | Inference speed, VRAM reduction |
| Pruning | Removes unimportant weights (sets to 0) | Medium (can be high) | Structured sparsity, specialized hardware |
| Distillation | Trains smaller model to mimic larger one | Depends on size gap | Permanent smaller model, edge deployment |
| Combined | QAT + pruning + distillation | Controllable | Maximum compression with accuracy budget |
- No accuracy loss vs FP32
- Native GPU support everywhere
- Default for HF Transformers
- Can overflow (max 65504)
- Less stable for training
- 2× FP32 memory still large
- Strong accuracy preservation
- Widely supported
- Excellent for production APIs
- W8A8 needs calibration data
- Some activations hard to quantize
- Outlier problem (see SmoothQuant)
- High accuracy preservation
- Mature tooling (AutoGPTQ)
- Pre-quantized models on HF Hub
- Slow offline quantization
- Less efficient than AWQ at inference time
- Slightly worse perplexity than AWQ
QLoRA training: Base model loaded in 4-bit NF4 (frozen). LoRA adapters trained in BF16. This dramatically reduces training VRAM. The 4-bit base is ONLY used to compute gradients for the adapters.
After QLoRA training: You have a base 4-bit model + LoRA adapter weights in BF16. For inference, you can: (a) serve base model + merge adapters at runtime, or (b) dequantize → merge → re-quantize with AWQ/GPTQ for production. Option (b) is better for serving.
SmoothQuant
Tackles the outlier problem in activations: some activation channels are 100× larger than others, making them hard to quantize. SmoothQuant mathematically migrates the quantization difficulty from activations to weights (which are easier to quantize) by introducing a per-channel scale factor. Enables W8A8 quantization with minimal accuracy loss.
ZeroQuant
Microsoft's method for W8A8 at the operator level. Uses token-wise quantization for activations and weight-wise for weights. Integrated into DeepSpeed. Supports INT4/INT8 with hardware-aware quantization kernels. Part of the ZeroQuant-V2 and ZeroQuant-FP extensions.
Hardware → Format Rules
| Model Size | FP16 VRAM | INT8 VRAM | INT4 VRAM | Min GPU (INT4) |
|---|---|---|---|---|
| 3B | 6 GB | 3 GB | 1.8 GB | RTX 3060 (8GB) |
| 7B | 14 GB | 7 GB | 4 GB | RTX 3060 12GB |
| 13B | 26 GB | 13 GB | 7 GB | RTX 3080 10GB |
| 34B | 68 GB | 34 GB | 17 GB | 2× RTX 3090 |
| 70B | 140 GB | 70 GB | 35 GB | 2× A100 40GB / 2× RTX 3090 |
| 405B | 810 GB | 405 GB | 202 GB | 8× A100 80GB (INT4) |
| Hardware | Best Quant | FP8 | INT8 | INT4 | GGUF | Notes |
|---|---|---|---|---|---|---|
| H100 80GB SXM | FP8 E4M3 | ✅ | ✅ | ✅ | ❌ | Native FP8 Tensor Cores. Best-in-class inference. Use TensorRT-LLM or vLLM. |
| A100 80GB | FP16 / AWQ | ⚠️ | ✅ | ✅ | ❌ | No native FP8. BF16 Tensor Cores. Workhorse for production LLM serving. |
| RTX 4090 24GB | AWQ 4-bit | ⚠️ | ✅ | ✅ | ✅ | Ada Lovelace. Software FP8 support. Great value for single-GPU inference. |
| RTX 3090 24GB | AWQ 4-bit / INT8 | ❌ | ✅ | ✅ | ✅ | Ampere. BF16 supported. Popular for local 70B (2×) or 13B single GPU serving. |
| RTX 3080 10GB | AWQ 4-bit | ❌ | ⚠️ | ✅ | ✅ | 10GB limits to 7B-13B INT4. Tight for INT8 on 13B. |
| T4 16GB (Cloud) | INT8 | ❌ | ✅ | ✅ | ✅ | Turing. No BF16. Common in GCP/AWS for cheap inference. FP16 inference. |
| V100 16/32GB | INT8 | ❌ | ✅ | ⚠️ | ❌ | Volta. No BF16, no INT4 Tensor Cores. Use bitsandbytes INT8 carefully. |
| Apple M2/M3 (MPS) | GGUF Q4_K_M | ❌ | ⚠️ | ✅ | ✅ | Unified memory. llama.cpp with MPS. Outstanding for local dev / 7B-13B models. |
| CPU (x86) | GGUF Q4_K_M | ❌ | ⚠️ | ⚠️ | ✅ | AVX2/AVX-512 for acceleration. llama.cpp. Slow but cost-effective for low traffic. |
| Jetson Orin (Edge) | INT4 / GGUF | ❌ | ✅ | ✅ | ✅ | Ampere GPU + ARM. Use TensorRT for INT8. GGUF Q4 for flexibility. |
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch # 4-bit NF4 (QLoRA style — for fine-tuning or quick experiments) bnb_config_4bit = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NF4 for normally-distributed weights bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16 during forward pass bnb_4bit_use_double_quant=True, # Quantize scale factors too (~0.4 bits extra saving) ) # 8-bit INT8 (better for inference accuracy) bnb_config_8bit = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0, # Outlier threshold — weights above go to FP16 ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3-8b-hf", quantization_config=bnb_config_4bit, device_map="auto", # Automatically distributes across available GPUs ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b-hf")
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig # Step 1: Quantize (run once offline, save result) quantize_config = BaseQuantizeConfig( bits=4, # 4-bit quantization group_size=128, # Larger = better quality, slightly more VRAM desc_act=False, # False = faster inference; True = better accuracy ) model = AutoGPTQForCausalLM.from_pretrained("meta-llama/Llama-3-8b-hf", quantize_config) model.quantize(calibration_examples) # List of tokenized text samples (~128) model.save_quantized("./llama3-8b-gptq-4bit") # Step 2: Load the quantized model for inference model = AutoGPTQForCausalLM.from_quantized( "./llama3-8b-gptq-4bit", device_map="auto", use_triton=True, # Faster Triton kernels if available )
from awq import AutoAWQForCausalLM from transformers import AutoTokenizer # Quantize offline model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3-8b-hf") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b-hf") quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"} model.quantize(tokenizer, quant_config=quant_config) model.save_quantized("./llama3-8b-awq-4bit") # Load in vLLM (preferred serving path) # vllm serve ./llama3-8b-awq-4bit --quantization awq
# Step 1: Clone llama.cpp git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make -j8 # or: cmake -B build && cmake --build build --config Release # Step 2: Convert HF model to GGUF F16 python convert_hf_to_gguf.py /path/to/llama3-8b-hf --outtype f16 --outfile llama3-8b-f16.gguf # Step 3: Quantize to Q4_K_M (recommended) ./llama-quantize llama3-8b-f16.gguf llama3-8b-q4km.gguf Q4_K_M # Step 4: Run inference (CPU) ./llama-cli -m llama3-8b-q4km.gguf -p "Explain quantization in one paragraph:" -n 256 # With partial GPU offload (e.g. 20 layers to GPU) ./llama-cli -m llama3-8b-q4km.gguf -ngl 20 -p "Hello" # Serve as OpenAI-compatible API ./llama-server -m llama3-8b-q4km.gguf --port 8080
# Static FP8 quantization via vLLM's quantization toolkit python -m llmcompressor.transformers.compression.compress \ --model meta-llama/Llama-3-70b-hf \ --recipe fp8_recipe.yaml \ --save_dir ./llama3-70b-fp8 # Or: dynamic FP8 (no calibration needed, slightly lower quality) vllm serve meta-llama/Llama-3-70b-hf \ --dtype float16 \ --quantization fp8 \ --tensor-parallel-size 4
The default choice for production GPU inference. Built around PagedAttention for near-zero KV cache waste, continuous batching, and a growing list of quantization support. OpenAI-compatible API out of the box. Python-native, easy to deploy.
- PagedAttention (near-zero memory waste)
- Continuous batching
- AWQ, GPTQ, FP8, INT8 support
- Tensor parallelism built-in
- OpenAI API compatible
- No CPU inference
- NVIDIA GPU only (AMD experimental)
- Less tunable than TensorRT
- Not ideal for very small models
NVIDIA's production-grade engine. Compiles models into TensorRT graphs with fused kernels, optimal memory layouts, and hardware-specific optimizations. Highest raw throughput on NVIDIA hardware, but complex to set up and inflexible (model format locked to TRT).
- Maximum throughput on NVIDIA
- FP8, INT8, SmoothQuant native
- Speculative decoding, LoRA
- Production-tested by NVIDIA
- Complex build pipeline
- NVIDIA-only, CUDA-locked
- Slow model compilation step
- Less flexible than vLLM
The CPU inference champion. Pure C++ with AVX/AVX2/AVX-512 and Apple MPS support. GGUF format with k-quants (Q2_K through Q8_0). Supports partial GPU offload. Default choice for local/CPU/edge deployments. Powers Ollama, LM Studio, Jan.
- CPU and Apple Silicon native
- Partial GPU offload
- OpenAI API via llama-server
- Wide model support
- Not for high-throughput GPU serving
- GGUF format only
- No continuous batching
The research and experimentation standard. Not optimized for production throughput but unmatched in flexibility and model coverage. Use for: fine-tuning, custom architectures, research, quick testing. Integrates bitsandbytes, GPTQ, and AWQ via quantization configs. Not for production serving at scale.
HuggingFace's production inference server. Continuous batching, tensor parallelism, GPTQ/AWQ support. Slightly less throughput than vLLM on benchmarks but very well-integrated with HF Hub models. Good choice if your stack is HF-centric.
Export-compile-deploy pipeline. Best for: cross-platform deployment (Windows, iOS, Android, WebAssembly), small models, edge devices. Supports INT8 quantization and DirectML for non-NVIDIA GPUs. Not ideal for large autoregressive LLMs — optimized for encoder models and smaller decoders.
Microsoft's inference optimization library. Kernel injection for fused ops, ZeroQuant INT4/INT8 quantization, and tensor parallelism. Best for very large models (>100B) on multi-GPU clusters where model doesn't fit in standard configurations. Also used for DeepSpeed-Chat serving pipeline.
| Engine | FP32 | FP16/BF16 | INT8 | INT4 | FP8 | GPTQ | AWQ | GGUF | NF4 (bnb) |
|---|---|---|---|---|---|---|---|---|---|
| vLLM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ⚠️ |
| TensorRT-LLM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| llama.cpp | ❌ | ⚠️ | ⚠️ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ |
| HF Transformers | ✅ | ✅ | ✅ | ✅ | ⚠️ | ✅ | ✅ | ❌ | ✅ |
| TGI | ❌ | ✅ | ✅ | ✅ | ⚠️ | ✅ | ✅ | ❌ | ❌ |
| ONNX Runtime | ✅ | ✅ | ✅ | ⚠️ | ❌ | ❌ | ❌ | ❌ | ❌ |
| DeepSpeed | ✅ | ✅ | ✅ | ✅ | ⚠️ | ❌ | ❌ | ❌ | ❌ |
✅ Native support ⚠️ Partial / experimental ❌ Not supported
Vision-Language Models (LLaVA, Qwen-VL, InternVL, Idefics, etc.) consist of at minimum two components: a vision encoder (ViT-based) and a text decoder (transformer LLM). These have fundamentally different quantization characteristics.
- Processes raw pixel values → patch embeddings
- Activations have high variance and extreme outliers
- Many operations are not matmul-dominated (Conv2d, LayerNorm)
- INT4 causes severe visual distortion artifacts
- FP16 or INT8 with careful calibration is the minimum
- CLIP encoders particularly sensitive
- Standard autoregressive transformer
- Can be aggressively quantized (INT4/AWQ)
- Larger proportion of model VRAM
- Standard LLM quantization techniques apply
- AWQ or GPTQ 4-bit works well here
| Failure Mode | Root Cause | Fix |
|---|---|---|
| Wrong object identification | Vision encoder INT4 causes embedding distortion | Upgrade vision encoder to FP16 |
| Color/count errors | Patch embedding quantization error | Use INT8 for vision encoder, calibrate |
| Coherent text but wrong image | Projector layer quantized aggressively | Keep projector in FP16 |
| Increased hallucination rate | Multimodal embedding space corrupted | Run visual QA benchmarks (VQAv2) post-quant |
| Slow inference despite quantization | Mixed precision dequantization overhead | Profile with nsight; batch vision pre-processing |
auto-gptq quantize --bits 4 --group-size 128 --model llama3-70b --output ./llama3-70b-gptq4b. Takes 1–4 hours on A100.vllm serve ./llama3-70b-gptq4b --quantization gptq --tensor-parallel-size 2 --port 8000locust or wrk to verify throughput under expected load. Monitor GPU memory and token/sec.huggingface-cli download meta-llama/Llama-3-8b-hf --local-dir ./llama3-8bpython llama.cpp/convert_hf_to_gguf.py ./llama3-8b --outtype f16 --outfile llama3-8b-f16.gguf./llama-quantize llama3-8b-f16.gguf llama3-8b-q4km.gguf Q4_K_M. ~4GB output for 8B model../llama-server -m llama3-8b-q4km.gguf -c 4096 --port 8080 -ngl 99 (ngl=layers on GPU, 99=all)python convert_checkpoint.py --model_dir llama3-70b --dtype float16 --use_smooth_quant --int8_kv_cachetrtllm-build --checkpoint_dir ./trt_ckpts --output_dir ./trt_engines --gemm_plugin float16. This step takes 20–60 min.KV cache = the biggest VRAM consumer after weights. For 70B, FP16 KV cache can exceed model weights at long contexts.
- PagedAttention (vLLM): Non-contiguous KV blocks, near-zero waste, enables sharing for parallel sampling
- INT8 KV cache: --kv-cache-dtype int8 in vLLM. 50% KV cache memory reduction, <0.5% quality loss
- KV cache quantization: per-token scale factors for activations; more accurate than per-tensor
- Context length management: Set --max-model-len conservatively. Longer context = quadratic KV growth
The key throughput unlock. Unlike static batching (wait for all requests, process together), continuous batching slots in new requests as soon as GPU compute is free.
- Eliminates idle GPU time between requests
- Supported in vLLM, TGI, TensorRT-LLM natively
- 3–10× throughput improvement over naive batching
- Critical for mixed-length workloads
Two distinct phases with different bottlenecks:
- Prefill: compute-bound (processes entire prompt in one pass). Optimize with chunked prefill to overlap with decode.
- Decode: memory-bandwidth bound (one token at a time). Optimize with batching (more sequences = better GPU utilization) and quantization.
- Chunked prefill (vLLM v1): Break long prompts into chunks, interleave with decode steps to avoid decode stalls.
Use a small draft model (e.g. 1B) to speculatively generate 3–5 tokens, then verify with the target model in a single forward pass. Effective for latency-bound workloads.
- Reduces decode latency 1.5–2.5× for appropriate tasks
- No accuracy loss (mathematically equivalent)
- Supported in vLLM, TensorRT-LLM
- Works best for predictable output distributions
Store and reuse KV cache for repeated prompt prefixes (system prompts, RAG context). In vLLM, Automatic Prefix Caching (APC) is enabled by default in v1. Eliminates redundant prefill computation for requests sharing a long common prefix. For chatbots with a fixed system prompt, this can reduce effective compute by 40–60%.
| Issue | Likely Cause | Diagnosis | Fix |
|---|---|---|---|
| Significant accuracy drop after INT4 | Model too small, or high-variance activations | Compare perplexity before/after. Run task evals. | Switch to INT8 or AWQ with smaller group size (g=64) |
| OOM during quantization | Full model loaded in FP16 before quantization | nvidia-smi, torch memory profiler | Quantize layer-by-layer, use CPU offload during GPTQ |
| OOM during serving | KV cache overflows, batch too large | Check vLLM logs for KV cache usage | Reduce --max-model-len, add --gpu-memory-utilization 0.85, use INT8 KV cache |
| Slow inference despite GPU | Memory bandwidth saturation, small batch size | Profile with nsight; check tokens/sec per GPU | Increase batch size, enable tensor parallelism, check for CPU-GPU transfer bottlenecks |
| Repetitive or incoherent output | Quantization degraded attention mechanisms | Compare outputs vs FP16 on same prompts | Increase precision (INT8→FP16), check if attention layers are quantized |
| GGUF slower than expected on CPU | Missing AVX2/AVX-512, wrong thread count | ./llama-cli --help check BLAS backend |
Compile with GGML_AVX2=ON, set -t to physical cores |
| VLM seeing wrong image content | Vision encoder over-quantized | Test with known images, compare to FP16 baseline | Keep vision encoder in FP16, only quantize text decoder |
| vLLM prefix cache miss rate high | Prompt instability, timestamp/dynamic content in prefix | Check vLLM metrics endpoint for cache hit rate | Move dynamic content to end of prompt, stabilize system prompt |
| TensorRT build fails | CUDA version mismatch, unsupported op | TRT build logs, check CUDA compute capability | Match TRT version to CUDA driver, verify model is in supported list |
Deploy on a $5/month VPS (4 cores, 8GB RAM) or even a local Mac Mini. ~2–5 tokens/sec on CPU. Suitable for <5 concurrent users. Ollama gives a one-command server. Total infra cost: near zero.
8× H100 cluster with vLLM in FP8. Continuous batching + prefix caching. INT8 KV cache. Target: 5,000 tokens/sec/GPU. Use a load balancer across multiple vLLM instances. Horizontal scaling with Kubernetes. With FP8 on H100: ~40% cost reduction vs FP16 on A100s.
vllm serve meta-llama/Llama-3-70B \ --tensor-parallel-size 8 \ --dtype float16 \ --quantization fp8 \ --enable-prefix-caching \ --kv-cache-dtype fp8_e5m2 \ --max-model-len 8192 \ --gpu-memory-utilization 0.90
Smart factory use case: 8 cameras feed images to VLM for defect detection. Batch images from multiple cameras into a single forward pass. Keep vision encoder in FP16 — critical for defect pixel-level accuracy. Use text decoder in AWQ 4-bit to fit on 2× RTX 3090. Pre-process images asynchronously to hide latency.
On-device inference with no cloud dependency. Jetson Orin: use TensorRT INT8 for maximum throughput (supports 7B models at ~10 tokens/sec). Raspberry Pi 5: GGUF Q4 on CPU via llama.cpp — 1–3B models only at ~1–3 tokens/sec. Functional for offline industrial QA assistants.
python benchmarks/benchmark_throughput.py--kv-cache-dtype int8 in vLLM. 50% KV cache memory reduction, <0.3% perplexity increase. This lets you serve longer contexts or more concurrent requests without touching model weights.from vllm import LLM, SamplingParams # Load AWQ 4-bit model llm = LLM( model="./llama3-8b-awq-4bit", quantization="awq", tensor_parallel_size=1, # Number of GPUs gpu_memory_utilization=0.90, # Reserve 10% for safety enable_prefix_caching=True, # APC - huge win for RAG kv_cache_dtype="fp8_e5m2", # INT8 KV cache max_model_len=8192, ) params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512) # Batched inference (always batch for throughput) prompts = ["Tell me about LLM quantization", "What is AWQ?"] outputs = llm.generate(prompts, params) for output in outputs: print(output.outputs[0].text)
# Start server (AWQ model, 2 GPUs, prefix caching enabled) vllm serve ./llama3-70b-awq \ --quantization awq \ --tensor-parallel-size 2 \ --enable-prefix-caching \ --kv-cache-dtype fp8_e5m2 \ --max-model-len 16384 \ --host 0.0.0.0 \ --port 8000 # Query with standard OpenAI client curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "llama3-70b-awq", "messages": [{"role": "user", "content": "Hello"}]}'
def estimate_vram(param_billions, dtype_bytes=2, kv_overhead=1.3): """ param_billions: model size (e.g. 7 for 7B) dtype_bytes: 2=FP16, 1=INT8, 0.5=INT4 kv_overhead: 1.2-1.4x for KV cache + activations """ weights_gb = ((param_billions * 1e9) * dtype_bytes) / (1024**3) return weights_gb * kv_overhead # Examples print(f"70B FP16: {estimate_vram(70, 2):.1f} GB") # ~182 GB print(f"70B INT8: {estimate_vram(70, 1):.1f} GB") # ~91 GB print(f"70B INT4: {estimate_vram(70, 0.5):.1f} GB") # ~45 GB print(f"8B INT4: {estimate_vram(8, 0.5):.1f} GB") # ~5.2 GB
| Method | Bits | Accuracy | Inference Speed | CPU | Primary Use | Tooling |
|---|---|---|---|---|---|---|
| FP16/BF16 | 16 | 🟢 Baseline | 🟡 Baseline | ❌ | Default GPU serving | All frameworks |
| FP8 E4M3 | 8 | 🟢 ~FP16 | 🟢 ~2× FP16 | ❌ | H100 production | vLLM, TensorRT |
| INT8 (W8A8) | 8 | 🟢 <1% loss | 🟢 1.5–2× | ⚠️ | Bandwidth-limited | bitsandbytes, TRT |
| AWQ 4-bit | 4 | 🟡 1–2% loss | 🟢 3–4× | ❌ | Production GPU 4-bit | AutoAWQ, vLLM |
| GPTQ 4-bit | 4 | 🟡 1–2.5% loss | 🟡 2–3× | ❌ | 4-bit GPU inference | AutoGPTQ, TGI |
| GGUF Q4_K_M | ~4.5 | 🟡 ~1.5% loss | 🟡 CPU-optimized | ✅ | CPU/local deployment | llama.cpp, Ollama |
| NF4 (bnb) | 4 | 🟡 1–2% loss | 🔴 Slow kernels | ❌ | QLoRA training only | bitsandbytes |
| Engine | Best For | Throughput | Ease of Use | GPU Required | CPU |
|---|---|---|---|---|---|
| vLLM | Production GPU serving | 🟢 Excellent | 🟢 Easy | ✅ | ❌ |
| TensorRT-LLM | Max throughput NVIDIA | 🟢 Best | 🔴 Complex | ✅ | ❌ |
| llama.cpp | CPU / local / edge | 🟡 CPU-limited | 🟢 Easy | Optional | ✅ |
| HF Transformers | Research / fine-tuning | 🔴 Poor | 🟢 Best | Optional | ✅ |
| TGI | HF-integrated production | 🟡 Good | 🟢 Good | ✅ | ❌ |
| ONNX Runtime | Cross-platform / edge | 🟡 Medium | 🟡 Medium | Optional | ✅ |
| DeepSpeed | 100B+ models | 🟡 Good | 🔴 Complex | ✅ | ❌ |
| Use Case | Model Size | Quantization | Engine | Hardware |
|---|---|---|---|---|
| Local developer assistant | 7–13B | GGUF Q4_K_M | Ollama | Mac M2+ / Any CPU |
| Team internal API | 7–34B | AWQ 4-bit | vLLM | 1–2× RTX 3090/4090 |
| Public API (<100 req/s) | 7–70B | AWQ 4-bit | vLLM | 2–4× A100 40GB |
| High-scale API (>1000 req/s) | 70B | FP8 | vLLM / TensorRT | 8× H100 |
| Fine-tuned model (QLoRA) | 7–13B | QLoRA train → AWQ serve | vLLM | 1× RTX 3090+ train, A100 serve |
| RAG application | 7–34B | AWQ 4-bit + APC | vLLM (prefix cache ON) | 1–2× A100 |
| VLM production | 7–26B | Vision: FP16, Text: AWQ | vLLM (VLM) / custom | 2–4× A100 |
| Edge / IoT | 1–7B | INT4 / GGUF Q4 | TensorRT / llama.cpp | Jetson Orin / ARM |
| Experimentation | Any | bitsandbytes 4-bit | HF Transformers | Any GPU |
In 90% of cases, this two-step decision covers the optimal choice. Layer on refinements (speculative decoding, KV cache tuning, SmoothQuant) only after validating this baseline works for your requirements.