Self-Hosting LLMs on Edge Hardware
A complete, battle-tested guide to running a production-ready LLM inference server on an NVIDIA Jetson AGX Thor — with team access control, cost tracking, and zero cloud dependency.
Introduction
Running large language models (LLMs) in-house gives your organisation complete control over your data, latency, and cost. This guide walks through deploying a production-grade, multi-user LLM inference stack on an NVIDIA Jetson AGX Thor — a powerful edge AI device with 128 GB of unified memory.
By the end of this guide you will have:
- A running Qwen3-VL-30B-A3B vision-language model (AWQ 4-bit quantised)
- An OpenAI-compatible API endpoint your team can point any client to
- Per-user API keys with rate limits and budget caps
- Token and spend tracking via a dashboard
- Everything running on local hardware — no cloud, no data leaving your network
Why Self-Host?
- Data privacy — prompts never leave your premises
- Cost predictability — no per-token cloud bills; your cost is electricity
- Latency — local inference is faster for sustained workloads
- Control — you choose the model, quantisation, and parameters
- Compliance — easier to satisfy data residency requirements
Architecture
The stack has three layers. Each layer has a single responsibility.
vLLM — The Model Server
vLLM loads the AI model weights into GPU memory and handles all inference. It exposes an
OpenAI-compatible HTTP API on port 8000 (internal only — never exposed outside Docker).
It manages GPU memory, batching, KV caching, and tokenisation. It has no authentication — that is
intentionally handled by LiteLLM.
LiteLLM — The Proxy Gateway
LiteLLM sits in front of vLLM and is the only component exposed to the outside world (port 12434).
It handles:
- API key validation — every request must carry a valid
Bearerkey - Model name routing — maps a friendly name like
Qwen3-VL-30B-A3Bto the vLLM endpoint - Per-key rate limiting, budget caps, and expiry
- Usage and spend logging to PostgreSQL
PostgreSQL — The State Store
PostgreSQL stores all persistent state: API keys, per-key spend, request logs, and user metadata. Without it, LiteLLM runs statelessly (no key management or spend tracking). With it, you get a full multi-user control plane.
Hardware — NVIDIA Jetson AGX Thor
The Jetson AGX Thor is NVIDIA's flagship edge AI module, built on the Thor SoC with a Hopper-class GPU. It is the only edge device capable of running a 30B-parameter model with reasonable performance.
| Component | Specification |
|---|---|
| GPU Architecture | NVIDIA Thor (Hopper-class) |
| Unified Memory | 128 GB LPDDR5X (shared CPU + GPU) |
| Memory Bandwidth | ~900 GB/s |
| CPU | ARM64 (Cortex-X architecture) |
| OS | JetPack / Ubuntu ARM64 |
| Docker Image | ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor |
Why 128 GB matters: The Qwen3-VL-30B-A3B model at AWQ 4-bit quantisation occupies roughly 15–18 GB. The remaining memory is available for the KV cache (context), which directly determines how long a conversation or document you can process in one shot.
Prerequisites
- NVIDIA Jetson AGX Thor with JetPack installed
- Docker installed (
dockerCLI — compose is not required) - NVIDIA Container Runtime configured (
--runtime nvidiamust work) - Model weights downloaded to
~/.cache/huggingface/(cyankiwi/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit) - The three Docker images pulled (or internet access to pull them):
ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thorghcr.io/berriai/litellm:main-latestpostgres:17
Docker Compose is not available on Jetson Thor. All commands in this guide use
plain docker run. This also means there is no docker compose (with a space) or
docker-compose (with a hyphen) command.
Step-by-Step Setup
Create a Docker Network
All three containers need to communicate with each other. Docker's bridge networking lets containers
reach each other by container name (e.g. ai-postgres) rather than IP address.
This is how LiteLLM finds vLLM and PostgreSQL without hardcoded IPs.
sudo docker network create infra-thor
Start PostgreSQL
PostgreSQL stores all API keys, spend data, and request logs. We mount a volume on the host so the database survives container restarts and upgrades.
sudo docker run -d \
--name ai-postgres \
--network infra-thor \
--restart unless-stopped \
-e POSTGRES_USER=litellm \
-e POSTGRES_PASSWORD=your-strong-db-password \
-e POSTGRES_DB=litellmdb \
-v /home/user/postgres_data:/var/lib/postgresql/data \
postgres:17
Verify it started:
sudo docker ps | grep ai-postgres
# Should show: Up X seconds
Start vLLM
This is the model inference server. On the Jetson Thor we use the NVIDIA-provided image
(ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor) which is compiled specifically for the
Thor SoC — the standard vllm/vllm-openai:latest image is x86-only and will not run on ARM64.
All model parameters are passed as CLI arguments (not a config file) because the image has a
conflicting directory at /app/config.yaml.
sudo docker run -d \
--name vllm-qwen3-vl-30b-a3b \
--network infra-thor \
--runtime nvidia \ # required for GPU access
--ipc host \ # shared memory between processes
--restart unless-stopped \
-e VLLM_NO_USAGE_STATS=1 \
-e VLLM_WORKER_MULTIPROC_METHOD=spawn \
-v ~/.cache/huggingface:/root/.cache/huggingface \
ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor \
python3 -m vllm.entrypoints.openai.api_server \
--model cyankiwi/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit \
--served-model-name Qwen3-VL-30B-A3B \ # friendly name clients use
--max-model-len 32768 \ # max context window (tokens)
--tensor-parallel-size 1 \ # single GPU
--gpu-memory-utilization 0.85 \ # use 85% of GPU memory
--dtype bfloat16 \ # native on Hopper-class GPU
--kv-cache-dtype fp8 \ # fp8 KV cache (Hopper feature)
--max-num-seqs 32 \ # max concurrent requests
--max-num-batched-tokens 32768 \ # max tokens in a batch
--limit-mm-per-prompt '{"image": 32, "video": 0}' \
--mm-processor-kwargs '{"max_pixels": 262144}' \
--compilation-config '{"compile_mm_encoder": true}' \
--enable-chunked-prefill \
--enable-prefix-caching \ # cache common prefixes (system prompts)
--async-scheduling \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--generation-config vllm
Wait for the model to load
The model takes ~3 minutes to load from disk into GPU memory. Watch the logs:
sudo docker logs -f vllm-qwen3-vl-30b-a3b
Wait until you see:
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000
Then press Ctrl+C to stop following the logs. vLLM keeps running in the background.
Create the LiteLLM Config File
LiteLLM reads a YAML config file that tells it which models exist and how to reach them. Create this file at a known absolute path on the host — we use this path in the next step.
Create /home/user/litellm/config.yaml with this content:
model_list:
- model_name: Qwen3-VL-30B-A3B # name clients call
litellm_params:
model: openai/Qwen3-VL-30B-A3B
api_base: http://vllm-qwen3-vl-30b-a3b:8000/v1 # vLLM container name
api_key: os.environ/LITELLM_MASTER_KEY
input_cost_per_token: 0.0000002 # $0.20 per 1M input tokens
output_cost_per_token: 0.000001 # $1.00 per 1M output tokens
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY # passed via -e at runtime
database_url: os.environ/DATABASE_URL # passed via -e at runtime
store_model_in_db: true
litellm_settings:
request_timeout: 600
The input_cost_per_token and output_cost_per_token values are
virtual costs — the model is free to run locally, but LiteLLM uses these numbers to
calculate per-key spend. Set them to whatever internal chargeback rate makes sense for your team,
or set both to 0 to disable cost tracking entirely.
Start LiteLLM
Note that we mount the config file to /litellm-config.yaml (not /app/config.yaml)
to avoid a directory conflict inside the image. The --config flag tells LiteLLM where to find it.
sudo docker run -d \
--name ai-litellm \
--network infra-thor \
-p 12434:12434 \ # the only container exposed externally
--restart unless-stopped \
-e LITELLM_MASTER_KEY=sk-your-master-key-here \
-e DATABASE_URL=postgresql://litellm:your-strong-db-password@ai-postgres:5432/litellmdb \
-v /home/user/litellm/config.yaml:/litellm-config.yaml:ro \
ghcr.io/berriai/litellm:main-latest \
--config /litellm-config.yaml --port 12434 --num_workers 4
Master key format: The key must start with sk-.
Generate a strong one with:
python3 -c "import secrets; print('sk-' + secrets.token_hex(16))"
Test the Stack
Verify all three containers are running:
sudo docker ps
# Expected: ai-postgres, vllm-qwen3-vl-30b-a3b, ai-litellm all showing "Up"
Send a test request:
curl http://localhost:12434/v1/chat/completions \
-H "Authorization: Bearer sk-your-master-key-here" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-VL-30B-A3B",
"messages": [{"role": "user", "content": "Hello!"}]
}'
You should receive a JSON response with the model's reply. The server is live.
Test with an image (vision capability):
curl http://localhost:12434/v1/chat/completions \
-H "Authorization: Bearer sk-your-master-key-here" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-VL-30B-A3B",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
}]
}'
Key Management
Your master key is the admin key — keep it private. Share individual generated keys with team members. Each person gets their own key with independent limits.
Generate a key for a team member
curl -X POST http://localhost:12434/key/generate \
-H "Authorization: Bearer sk-your-master-key-here" \
-H "Content-Type: application/json" \
-d '{
"key_alias": "alice", # identifier — who owns this key
"models": ["Qwen3-VL-30B-A3B"], # [] means access to all models
"rpm_limit": 10, # max requests per minute
"tpm_limit": 50000, # max tokens per minute
"max_parallel_requests": 2, # max simultaneous requests
"max_budget": 5.0, # key stops working after $5 virtual spend
"budget_duration": "30d", # reset budget every 30 days
"duration": "30d", # key expires in 30 days (null = never)
"metadata": {
"team": "engineering", # custom info for your records
"email": "alice@yourcompany.com"
}
}'
The response includes the generated key (e.g. sk-abc123...). Send that key to Alice — it's shown only once.
Update a key
curl -X POST http://localhost:12434/key/update \
-H "Authorization: Bearer sk-your-master-key-here" \
-H "Content-Type: application/json" \
-d '{"key": "sk-alice-key", "rpm_limit": 20}'
List all keys
curl http://localhost:12434/key/list \
-H "Authorization: Bearer sk-your-master-key-here"
Revoke a key
curl -X POST http://localhost:12434/key/delete \
-H "Authorization: Bearer sk-your-master-key-here" \
-H "Content-Type: application/json" \
-d '{"keys": ["sk-alice-key"]}'
Cost & Usage Tracking
Because PostgreSQL is running, LiteLLM logs every request with token counts and a calculated virtual cost. Even though you are hosting the model yourself, assigning a price per token lets you:
- Track which teams or individuals are the heaviest users
- Enforce per-key budget caps that automatically cut off a key when it is exceeded
- Do internal chargeback — bill departments based on actual consumption
- Understand the equivalent cloud cost you are saving by self-hosting
Why Set a Price for a Free Local Model?
The model costs nothing per token to run — your cost is electricity and hardware amortisation. But setting a reference price that mirrors what you would pay a cloud provider gives your spend numbers real meaning. It answers: "How much would this have cost us on the cloud?"
It also makes the budget cap on keys genuinely useful. If Alice's key has a
max_budget of $10 and you use cloud-equivalent pricing, you know she has used
roughly $10 worth of compute — a concrete limit regardless of token volume.
Where to Find Reference Prices for Open-Source Models
There is no single authoritative source. The strategy is to look at what commercial API providers charge for the same or similar model and use that as your reference rate. Below are the best sources, in order of usefulness:
1. OpenRouter — openrouter.ai/models
OpenRouter is the most comprehensive price directory for open-source models. Search for the model name and you will see per-token prices from multiple providers side by side. It lists hundreds of open-source models including Qwen, LLaMA, Mistral, Gemma, and more. Use the median price across providers as your reference.
2. Together AI — api.together.ai
Together AI hosts many open-source models. Their pricing page shows per-token rates for the exact model families you are likely running. Qwen and LLaMA variants are well represented.
3. Fireworks AI — fireworks.ai/pricing
Another reliable source for open-source model pricing. Particularly good for MoE (Mixture of Experts) models like the Qwen3-30B-A3B series, which are priced lower than dense models of similar quality.
4. Artificial Analysis — artificialanalysis.ai
Benchmarks quality, speed, and price across providers for the same model. Useful for cross-checking that your reference price is in a reasonable range.
Reference prices for this guide's model
The following are approximate market rates as of mid-2025 for the Qwen3-VL-30B-A3B class of model. Use these as your starting point — verify against the sources above for the latest figures.
| Model tier | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
|---|---|---|---|
| Qwen3-VL-30B-A3B (this guide) |
~$0.20 | ~$0.60 – $1.00 | MoE model, priced below equivalent dense 30B. Vision-language capable. |
| Qwen3-8B class | ~$0.05 – $0.10 | ~$0.10 – $0.20 | Smaller, faster, lower cost |
| LLaMA 3.1 70B class | ~$0.40 – $0.60 | ~$0.80 – $1.20 | Dense 70B, higher cost than MoE 30B |
| Embedding models | ~$0.01 – $0.02 | $0.00 | Output cost is always zero for embeddings |
Per-token math: divide the per-million price by 1,000,000 to get the per-token value
for the config file. For example, $0.20 per 1M input tokens =
0.20 / 1_000_000 = 0.0000002.
Output at $1.00 per 1M = 1.00 / 1_000_000 = 0.000001.
Updating the Config to Track Costs
Open /home/user/litellm/config.yaml in a text editor and add (or update)
the two cost lines under your model's litellm_params:
model_list:
- model_name: Qwen3-VL-30B-A3B
litellm_params:
model: openai/Qwen3-VL-30B-A3B
api_base: http://vllm-qwen3-vl-30b-a3b:8000/v1
api_key: os.environ/LITELLM_MASTER_KEY
input_cost_per_token: 0.0000002 # $0.20 per 1M input tokens
output_cost_per_token: 0.000001 # $1.00 per 1M output tokens
After saving the file, restart LiteLLM to apply the change:
sudo docker restart ai-litellm
Existing spend is not recalculated. Changing the price only affects requests made after the restart. Historical logs keep the cost they were recorded with. If you want a clean slate, truncate the spend tables in PostgreSQL before restarting.
Viewing Spend & Usage
Web Dashboard (recommended)
Open in any browser on your network:
http://<thor-device-ip>:12434/ui
Log in with your master key. The dashboard shows: spend per key over time, total token consumption, request counts, model breakdown, and a live request log. This is the easiest way to share usage data with managers or finance.
API — Total spend across all keys
curl http://localhost:12434/global/spend \
-H "Authorization: Bearer sk-your-master-key-here"
API — Spend broken down by key
curl http://localhost:12434/global/spend/keys \
-H "Authorization: Bearer sk-your-master-key-here"
API — Info and cumulative spend for one specific key
curl "http://localhost:12434/key/info?key=sk-alice-key" \
-H "Authorization: Bearer sk-your-master-key-here"
API — Detailed per-request logs
curl http://localhost:12434/spend/logs \
-H "Authorization: Bearer sk-your-master-key-here"
Daily Operations
Check container status
sudo docker ps
View logs
# vLLM (model loading, inference errors)
sudo docker logs -f vllm-qwen3-vl-30b-a3b
# LiteLLM (API requests, auth errors)
sudo docker logs -f ai-litellm
# PostgreSQL
sudo docker logs -f ai-postgres
Stop everything
sudo docker stop ai-litellm vllm-qwen3-vl-30b-a3b ai-postgres
Start everything (after a stop or reboot)
All containers are started with --restart unless-stopped, so they automatically restart
after a device reboot. If you manually stopped them:
# Start in this order — postgres first, vLLM second, LiteLLM last
sudo docker start ai-postgres
sudo docker start vllm-qwen3-vl-30b-a3b
sudo docker start ai-litellm
Restart LiteLLM after config change
sudo docker restart ai-litellm
Update the LiteLLM image
sudo docker pull ghcr.io/berriai/litellm:main-latest
sudo docker stop ai-litellm
sudo docker rm ai-litellm
# Then re-run the docker run command from Step 5
Troubleshooting
IsADirectoryError: /app/config.yaml
The NVIDIA Jetson vLLM image has a directory at /app/config.yaml which conflicts with
Docker file mounts. Solution: pass all model arguments as CLI flags (as shown in Step 3)
instead of mounting a config file. For LiteLLM, mount to /litellm-config.yaml instead of
/app/config.yaml.
Container name already in use
sudo docker rm -f <container-name>
# Then re-run the docker run command
Network not found
sudo docker network create infra-thor
# Then re-run the docker run command
LiteLLM returns 401 Unauthorized
The key is wrong or was not passed correctly. Verify:
- The
Authorizationheader is exactlyBearer sk-your-key(with the space) - The key exists — check via
/key/list - The key has not expired and has not exceeded its budget
curl: Failed to connect to port 12434
LiteLLM is not running or crashed. Check:
sudo docker ps -a | grep ai-litellm
# If status is "Exited", check why:
sudo docker logs ai-litellm
vLLM is slow to start
The first startup compiles CUDA kernels. This can take 5–10 minutes on the first run. Subsequent starts are faster because the compilation cache is reused.
Out of memory on vLLM
Reduce --gpu-memory-utilization from 0.85 to 0.75,
or reduce --max-model-len from 32768 to 16384.
Quick Reference
Stack summary
| Component | Container name | Port | Image |
|---|---|---|---|
| PostgreSQL | ai-postgres | 5432 (internal) | postgres:17 |
| vLLM | vllm-qwen3-vl-30b-a3b | 8000 (internal) | ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor |
| LiteLLM | ai-litellm | 12434 (external) | ghcr.io/berriai/litellm:main-latest |
Model parameters explained
| Parameter | Value | Why |
|---|---|---|
| --dtype | bfloat16 | Native precision on Hopper-class GPU |
| --kv-cache-dtype | fp8 | Halves KV cache memory; supported on Hopper |
| --gpu-memory-utilization | 0.85 | Leaves 15% headroom for system |
| --max-model-len | 32768 | 32K token context window |
| --max-num-seqs | 32 | Max concurrent users being served |
| --enable-prefix-caching | — | Reuses KV cache for repeated system prompts |
| --enable-chunked-prefill | — | Better latency for long prompts |
| --tool-call-parser | qwen3_xml | Enables function calling for Qwen3 models |
Key duration shortcuts
| Value | Meaning |
|---|---|
| "1h" | 1 hour |
| "24h" | 1 day |
| "7d" | 1 week |
| "30d" | 1 month |
| "90d" | 3 months |
| null | Never expires |