Go-To Guide

Self-Hosting LLMs on Edge Hardware

A complete, battle-tested guide to running a production-ready LLM inference server on an NVIDIA Jetson AGX Thor — with team access control, cost tracking, and zero cloud dependency.

Jetson AGX Thor

vLLM + LiteLLM + PostgreSQL

Qwen3-VL-30B-A3B

Docker

Introduction

Running large language models (LLMs) in-house gives your organisation complete control over your data, latency, and cost. This guide walks through deploying a production-grade, multi-user LLM inference stack on an NVIDIA Jetson AGX Thor — a powerful edge AI device with 128 GB of unified memory.

By the end of this guide you will have:

A running Qwen3-VL-30B-A3B vision-language model (AWQ 4-bit quantised)
An OpenAI-compatible API endpoint your team can point any client to
Per-user API keys with rate limits and budget caps
Token and spend tracking via a dashboard
Everything running on local hardware — no cloud, no data leaving your network

Why Self-Host?

Data privacy — prompts never leave your premises
Cost predictability — no per-token cloud bills; your cost is electricity
Latency — local inference is faster for sustained workloads
Control — you choose the model, quantisation, and parameters
Compliance — easier to satisfy data residency requirements

Architecture

The stack has three layers. Each layer has a single responsibility.

External clients (curl, Python, any OpenAI SDK)

Your App / curl

any OpenAI-compatible client

↓ port 12434

LiteLLM Proxy

auth · routing · usage tracking

↓ port 8000 (internal Docker network)

vLLM Server

model inference · GPU · KV cache

↓

PostgreSQL

keys · spend · logs

vLLM — The Model Server

vLLM loads the AI model weights into GPU memory and handles all inference. It exposes an OpenAI-compatible HTTP API on port 8000 (internal only — never exposed outside Docker). It manages GPU memory, batching, KV caching, and tokenisation. It has no authentication — that is intentionally handled by LiteLLM.

LiteLLM — The Proxy Gateway

LiteLLM sits in front of vLLM and is the only component exposed to the outside world (port 12434). It handles:

API key validation — every request must carry a valid Bearer key
Model name routing — maps a friendly name like Qwen3-VL-30B-A3B to the vLLM endpoint
Per-key rate limiting, budget caps, and expiry
Usage and spend logging to PostgreSQL

PostgreSQL — The State Store

PostgreSQL stores all persistent state: API keys, per-key spend, request logs, and user metadata. Without it, LiteLLM runs statelessly (no key management or spend tracking). With it, you get a full multi-user control plane.

Hardware — NVIDIA Jetson AGX Thor

The Jetson AGX Thor is NVIDIA's flagship edge AI module, built on the Thor SoC with a Hopper-class GPU. It is the only edge device capable of running a 30B-parameter model with reasonable performance.

Component	Specification
GPU Architecture	NVIDIA Thor (Hopper-class)
Unified Memory	128 GB LPDDR5X (shared CPU + GPU)
Memory Bandwidth	~900 GB/s
CPU	ARM64 (Cortex-X architecture)
OS	JetPack / Ubuntu ARM64
Docker Image	`ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor`

💡

Why 128 GB matters: The Qwen3-VL-30B-A3B model at AWQ 4-bit quantisation occupies roughly 15–18 GB. The remaining memory is available for the KV cache (context), which directly determines how long a conversation or document you can process in one shot.

Prerequisites

NVIDIA Jetson AGX Thor with JetPack installed
Docker installed (docker CLI — compose is not required)
NVIDIA Container Runtime configured (--runtime nvidia must work)
Model weights downloaded to ~/.cache/huggingface/ (cyankiwi/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit)
The three Docker images pulled (or internet access to pull them):
- ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor
- ghcr.io/berriai/litellm:main-latest
- postgres:17

⚠️

Docker Compose is not available on Jetson Thor. All commands in this guide use plain docker run. This also means there is no docker compose (with a space) or docker-compose (with a hyphen) command.

Step-by-Step Setup

Create a Docker Network

All three containers need to communicate with each other. Docker's bridge networking lets containers reach each other by container name (e.g. ai-postgres) rather than IP address. This is how LiteLLM finds vLLM and PostgreSQL without hardcoded IPs.

sudo docker network create infra-thor

Start PostgreSQL

PostgreSQL stores all API keys, spend data, and request logs. We mount a volume on the host so the database survives container restarts and upgrades.

sudo docker run -d \
  --name ai-postgres \
  --network infra-thor \
  --restart unless-stopped \
  -e POSTGRES_USER=litellm \
  -e POSTGRES_PASSWORD=your-strong-db-password \
  -e POSTGRES_DB=litellmdb \
  -v /home/user/postgres_data:/var/lib/postgresql/data \
  postgres:17

Verify it started:

sudo docker ps | grep ai-postgres
# Should show: Up X seconds

Start vLLM

This is the model inference server. On the Jetson Thor we use the NVIDIA-provided image (ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor) which is compiled specifically for the Thor SoC — the standard vllm/vllm-openai:latest image is x86-only and will not run on ARM64.

All model parameters are passed as CLI arguments (not a config file) because the image has a conflicting directory at /app/config.yaml.

sudo docker run -d \
  --name vllm-qwen3-vl-30b-a3b \
  --network infra-thor \
  --runtime nvidia \       # required for GPU access
  --ipc host \           # shared memory between processes
  --restart unless-stopped \
  -e VLLM_NO_USAGE_STATS=1 \
  -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor \
  python3 -m vllm.entrypoints.openai.api_server \
    --model cyankiwi/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit \
    --served-model-name Qwen3-VL-30B-A3B \    # friendly name clients use
    --max-model-len 32768 \            # max context window (tokens)
    --tensor-parallel-size 1 \          # single GPU
    --gpu-memory-utilization 0.85 \     # use 85% of GPU memory
    --dtype bfloat16 \                  # native on Hopper-class GPU
    --kv-cache-dtype fp8 \              # fp8 KV cache (Hopper feature)
    --max-num-seqs 32 \                 # max concurrent requests
    --max-num-batched-tokens 32768 \    # max tokens in a batch
    --limit-mm-per-prompt '{"image": 32, "video": 0}' \
    --mm-processor-kwargs '{"max_pixels": 262144}' \
    --compilation-config '{"compile_mm_encoder": true}' \
    --enable-chunked-prefill \
    --enable-prefix-caching \           # cache common prefixes (system prompts)
    --async-scheduling \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --generation-config vllm

Wait for the model to load

The model takes ~3 minutes to load from disk into GPU memory. Watch the logs:

sudo docker logs -f vllm-qwen3-vl-30b-a3b

Wait until you see:

INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000

Then press Ctrl+C to stop following the logs. vLLM keeps running in the background.

Create the LiteLLM Config File

LiteLLM reads a YAML config file that tells it which models exist and how to reach them. Create this file at a known absolute path on the host — we use this path in the next step.

Create /home/user/litellm/config.yaml with this content:

model_list:
  - model_name: Qwen3-VL-30B-A3B         # name clients call
    litellm_params:
      model: openai/Qwen3-VL-30B-A3B
      api_base: http://vllm-qwen3-vl-30b-a3b:8000/v1  # vLLM container name
      api_key: os.environ/LITELLM_MASTER_KEY
      input_cost_per_token: 0.0000002    # $0.20 per 1M input tokens
      output_cost_per_token: 0.000001    # $1.00 per 1M output tokens

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY      # passed via -e at runtime
  database_url: os.environ/DATABASE_URL           # passed via -e at runtime
  store_model_in_db: true

litellm_settings:
  request_timeout: 600

💡

The input_cost_per_token and output_cost_per_token values are virtual costs — the model is free to run locally, but LiteLLM uses these numbers to calculate per-key spend. Set them to whatever internal chargeback rate makes sense for your team, or set both to 0 to disable cost tracking entirely.

Start LiteLLM

Note that we mount the config file to /litellm-config.yaml (not /app/config.yaml) to avoid a directory conflict inside the image. The --config flag tells LiteLLM where to find it.

sudo docker run -d \
  --name ai-litellm \
  --network infra-thor \
  -p 12434:12434 \             # the only container exposed externally
  --restart unless-stopped \
  -e LITELLM_MASTER_KEY=sk-your-master-key-here \
  -e DATABASE_URL=postgresql://litellm:your-strong-db-password@ai-postgres:5432/litellmdb \
  -v /home/user/litellm/config.yaml:/litellm-config.yaml:ro \
  ghcr.io/berriai/litellm:main-latest \
  --config /litellm-config.yaml --port 12434 --num_workers 4

⚠️

Master key format: The key must start with sk-. Generate a strong one with:
python3 -c "import secrets; print('sk-' + secrets.token_hex(16))"

Test the Stack

Verify all three containers are running:

sudo docker ps
# Expected: ai-postgres, vllm-qwen3-vl-30b-a3b, ai-litellm all showing "Up"

Send a test request:

curl http://localhost:12434/v1/chat/completions \
  -H "Authorization: Bearer sk-your-master-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-VL-30B-A3B",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

You should receive a JSON response with the model's reply. The server is live.

Test with an image (vision capability):

curl http://localhost:12434/v1/chat/completions \
  -H "Authorization: Bearer sk-your-master-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-VL-30B-A3B",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
      ]
    }]
  }'

Key Management

Your master key is the admin key — keep it private. Share individual generated keys with team members. Each person gets their own key with independent limits.

Generate a key for a team member

curl -X POST http://localhost:12434/key/generate \
  -H "Authorization: Bearer sk-your-master-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "key_alias": "alice",                  # identifier — who owns this key
    "models": ["Qwen3-VL-30B-A3B"],        # [] means access to all models
    "rpm_limit": 10,                       # max requests per minute
    "tpm_limit": 50000,                    # max tokens per minute
    "max_parallel_requests": 2,            # max simultaneous requests
    "max_budget": 5.0,                     # key stops working after $5 virtual spend
    "budget_duration": "30d",             # reset budget every 30 days
    "duration": "30d",                     # key expires in 30 days (null = never)
    "metadata": {
      "team": "engineering",               # custom info for your records
      "email": "alice@yourcompany.com"
    }
  }'

The response includes the generated key (e.g. sk-abc123...). Send that key to Alice — it's shown only once.

Update a key

curl -X POST http://localhost:12434/key/update \
  -H "Authorization: Bearer sk-your-master-key-here" \
  -H "Content-Type: application/json" \
  -d '{"key": "sk-alice-key", "rpm_limit": 20}'

List all keys

curl http://localhost:12434/key/list \
  -H "Authorization: Bearer sk-your-master-key-here"

Revoke a key

curl -X POST http://localhost:12434/key/delete \
  -H "Authorization: Bearer sk-your-master-key-here" \
  -H "Content-Type: application/json" \
  -d '{"keys": ["sk-alice-key"]}'

Cost & Usage Tracking

Because PostgreSQL is running, LiteLLM logs every request with token counts and a calculated virtual cost. Even though you are hosting the model yourself, assigning a price per token lets you:

Track which teams or individuals are the heaviest users
Enforce per-key budget caps that automatically cut off a key when it is exceeded
Do internal chargeback — bill departments based on actual consumption
Understand the equivalent cloud cost you are saving by self-hosting

Why Set a Price for a Free Local Model?

The model costs nothing per token to run — your cost is electricity and hardware amortisation. But setting a reference price that mirrors what you would pay a cloud provider gives your spend numbers real meaning. It answers: "How much would this have cost us on the cloud?"

It also makes the budget cap on keys genuinely useful. If Alice's key has a max_budget of $10 and you use cloud-equivalent pricing, you know she has used roughly $10 worth of compute — a concrete limit regardless of token volume.

Where to Find Reference Prices for Open-Source Models

There is no single authoritative source. The strategy is to look at what commercial API providers charge for the same or similar model and use that as your reference rate. Below are the best sources, in order of usefulness:

1. OpenRouter — openrouter.ai/models

OpenRouter is the most comprehensive price directory for open-source models. Search for the model name and you will see per-token prices from multiple providers side by side. It lists hundreds of open-source models including Qwen, LLaMA, Mistral, Gemma, and more. Use the median price across providers as your reference.

2. Together AI — api.together.ai

Together AI hosts many open-source models. Their pricing page shows per-token rates for the exact model families you are likely running. Qwen and LLaMA variants are well represented.

3. Fireworks AI — fireworks.ai/pricing

Another reliable source for open-source model pricing. Particularly good for MoE (Mixture of Experts) models like the Qwen3-30B-A3B series, which are priced lower than dense models of similar quality.

4. Artificial Analysis — artificialanalysis.ai

Benchmarks quality, speed, and price across providers for the same model. Useful for cross-checking that your reference price is in a reasonable range.

Reference prices for this guide's model

The following are approximate market rates as of mid-2025 for the Qwen3-VL-30B-A3B class of model. Use these as your starting point — verify against the sources above for the latest figures.

Model tier	Input (per 1M tokens)	Output (per 1M tokens)	Notes
Qwen3-VL-30B-A3B (this guide)	~$0.20	~$0.60 – $1.00	MoE model, priced below equivalent dense 30B. Vision-language capable.
Qwen3-8B class	~$0.05 – $0.10	~$0.10 – $0.20	Smaller, faster, lower cost
LLaMA 3.1 70B class	~$0.40 – $0.60	~$0.80 – $1.20	Dense 70B, higher cost than MoE 30B
Embedding models	~$0.01 – $0.02	$0.00	Output cost is always zero for embeddings

💡

Per-token math: divide the per-million price by 1,000,000 to get the per-token value for the config file. For example, $0.20 per 1M input tokens = 0.20 / 1_000_000 = 0.0000002. Output at $1.00 per 1M = 1.00 / 1_000_000 = 0.000001.

Updating the Config to Track Costs

Open /home/user/litellm/config.yaml in a text editor and add (or update) the two cost lines under your model's litellm_params:

model_list:
  - model_name: Qwen3-VL-30B-A3B
    litellm_params:
      model: openai/Qwen3-VL-30B-A3B
      api_base: http://vllm-qwen3-vl-30b-a3b:8000/v1
      api_key: os.environ/LITELLM_MASTER_KEY
      input_cost_per_token: 0.0000002   # $0.20 per 1M input tokens
      output_cost_per_token: 0.000001    # $1.00 per 1M output tokens

After saving the file, restart LiteLLM to apply the change:

sudo docker restart ai-litellm

⚠️

Existing spend is not recalculated. Changing the price only affects requests made after the restart. Historical logs keep the cost they were recorded with. If you want a clean slate, truncate the spend tables in PostgreSQL before restarting.

Viewing Spend & Usage

Web Dashboard (recommended)

Open in any browser on your network:

http://<thor-device-ip>:12434/ui

Log in with your master key. The dashboard shows: spend per key over time, total token consumption, request counts, model breakdown, and a live request log. This is the easiest way to share usage data with managers or finance.

API — Total spend across all keys

curl http://localhost:12434/global/spend \
  -H "Authorization: Bearer sk-your-master-key-here"

API — Spend broken down by key

curl http://localhost:12434/global/spend/keys \
  -H "Authorization: Bearer sk-your-master-key-here"

API — Info and cumulative spend for one specific key

curl "http://localhost:12434/key/info?key=sk-alice-key" \
  -H "Authorization: Bearer sk-your-master-key-here"

API — Detailed per-request logs

curl http://localhost:12434/spend/logs \
  -H "Authorization: Bearer sk-your-master-key-here"

Daily Operations

Check container status

sudo docker ps

View logs

# vLLM (model loading, inference errors)
sudo docker logs -f vllm-qwen3-vl-30b-a3b

# LiteLLM (API requests, auth errors)
sudo docker logs -f ai-litellm

# PostgreSQL
sudo docker logs -f ai-postgres

Stop everything

sudo docker stop ai-litellm vllm-qwen3-vl-30b-a3b ai-postgres

Start everything (after a stop or reboot)

All containers are started with --restart unless-stopped, so they automatically restart after a device reboot. If you manually stopped them:

# Start in this order — postgres first, vLLM second, LiteLLM last
sudo docker start ai-postgres
sudo docker start vllm-qwen3-vl-30b-a3b
sudo docker start ai-litellm

Restart LiteLLM after config change

sudo docker restart ai-litellm

Update the LiteLLM image

sudo docker pull ghcr.io/berriai/litellm:main-latest
sudo docker stop ai-litellm
sudo docker rm ai-litellm
# Then re-run the docker run command from Step 5

Troubleshooting

IsADirectoryError: /app/config.yaml

The NVIDIA Jetson vLLM image has a directory at /app/config.yaml which conflicts with Docker file mounts. Solution: pass all model arguments as CLI flags (as shown in Step 3) instead of mounting a config file. For LiteLLM, mount to /litellm-config.yaml instead of /app/config.yaml.

Container name already in use

sudo docker rm -f <container-name>
# Then re-run the docker run command

Network not found

sudo docker network create infra-thor
# Then re-run the docker run command

LiteLLM returns 401 Unauthorized

The key is wrong or was not passed correctly. Verify:

The Authorization header is exactly Bearer sk-your-key (with the space)
The key exists — check via /key/list
The key has not expired and has not exceeded its budget

curl: Failed to connect to port 12434

LiteLLM is not running or crashed. Check:

sudo docker ps -a | grep ai-litellm
# If status is "Exited", check why:
sudo docker logs ai-litellm

vLLM is slow to start

The first startup compiles CUDA kernels. This can take 5–10 minutes on the first run. Subsequent starts are faster because the compilation cache is reused.

Out of memory on vLLM

Reduce --gpu-memory-utilization from 0.85 to 0.75, or reduce --max-model-len from 32768 to 16384.

Quick Reference

Stack summary

Component	Container name	Port	Image
PostgreSQL	ai-postgres	5432 (internal)	postgres:17
vLLM	vllm-qwen3-vl-30b-a3b	8000 (internal)	ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor
LiteLLM	ai-litellm	12434 (external)	ghcr.io/berriai/litellm:main-latest

Model parameters explained

Parameter	Value	Why
--dtype	bfloat16	Native precision on Hopper-class GPU
--kv-cache-dtype	fp8	Halves KV cache memory; supported on Hopper
--gpu-memory-utilization	0.85	Leaves 15% headroom for system
--max-model-len	32768	32K token context window
--max-num-seqs	32	Max concurrent users being served
--enable-prefix-caching	—	Reuses KV cache for repeated system prompts
--enable-chunked-prefill	—	Better latency for long prompts
--tool-call-parser	qwen3_xml	Enables function calling for Qwen3 models

Key duration shortcuts

Value	Meaning
"1h"	1 hour
"24h"	1 day
"7d"	1 week
"30d"	1 month
"90d"	3 months
null	Never expires