llama.cpp Router Mode On Proxmox
Run multiple llama.cpp models behind one Proxmox-hosted router, with preset files, VRAM planning, LRU eviction, and one Open WebUI connection for the whole stack.
Published January 10, 2025 · Updated January 20, 2025
llama.cpp Router Mode On Proxmox
Single-model serving is where most homelabs start.
Eventually that stops being enough. You want a small fast model for casual questions, something heavier for reasoning, maybe a code-biased model for development work, and you do not want to SSH into the container every time you switch.
That is the point of Router Mode.
If the base llama.cpp stack is not in place yet, start with llama.cpp Inference On Proxmox. If you want the concrete dual-RTX-3090 deployment example after the architecture is clear, continue to Router Mode Deployment Example.
If the goal is still a browser-first interface but the UI should live in a separate guest, continue to Open WebUI Standalone Frontend On Proxmox.
If the goal is no longer just a browser client but a persistent assistant riding on top of the same multi-model endpoint, continue to OpenClaw On Proxmox.
Why Multi-Model Management
A single llama-server instance locked to one model forces you to stop the server, change the --model flag, and restart every time you want to switch models. This is impractical when:
- You use different models for different tasks — a small fast model for quick questions, a large reasoning model for complex problems, a code-specialised model for development
- You want to A/B test models — compare Qwen3 vs DeepSeek R1 vs Llama 3.3 on the same prompts
- You share the server with multiple users who prefer different models
- You want on-demand availability — select any model from Open WebUI's dropdown without SSH-ing into the container to restart the server
llama.cpp's Router Mode (introduced December 2025) solves all of these. It is a built-in, first-class multi-model manager — no external tools, no wrapper scripts, no Ollama.
Architecture Decision: Single Container With Router Mode
The Three Approaches
| Approach | Description | Pros | Cons |
|---|---|---|---|
| A. Single Container + Router Mode | One llama-server in router mode manages all models | Single GPU config, one systemd service, one Open WebUI connection, built-in LRU eviction, crash isolation per model (multi-process) | All models share one GPU; --models-max limits by count, not VRAM |
B. Multiple llama-server Instances | Separate processes on different ports in the same container | Full control per model, dedicated resources, no eviction surprises | Manual VRAM management, multiple systemd services, multiple Open WebUI connections, no automatic model swapping |
| C. Separate LXC Containers | One container per model | Complete isolation | Massive overhead — duplicate NVIDIA driver (~500 MB), CUDA toolkit (~4 GB), llama.cpp build per container; no VRAM arbitration between containers; complex networking |
Recommendation
Use Approach A (Router Mode) for nearly all homelab scenarios. It was purpose-built for this use case and provides:
- Multi-process architecture — each model runs in its own child process. If one model crashes, the router and other models are unaffected.
- Automatic request routing — the
"model"field in API requests determines which model handles it. Open WebUI sends this natively when users select a model from the dropdown. - LRU eviction — when
--models-maxis reached, the least-recently-used model auto-unloads to make room. - Idle auto-unload —
--sleep-idle-secondsfrees VRAM from models that haven't been used recently. - Single connection — Open WebUI needs only one OpenAI-compatible connection. All models appear in the dropdown automatically.
Disable Ollama To Reclaim VRAM
Since llama.cpp provides superior performance and control over inference, Ollama can be disabled to reclaim the full 48 GB VRAM for llama.cpp.
Stop and Disable Ollama
# Enter the Ollama container (CT 100)
pct enter 100
# Stop the service
systemctl stop ollama
# Disable auto-start on boot
systemctl disable ollama
# Verify it's stopped
systemctl status ollama
# Expected: inactive (dead), disabledVerify VRAM Is Free
# On the Proxmox host
nvidia-smiAlternative: Keep Ollama as a Fallback
# In CT 100 — edit ollama service
systemctl edit ollama.serviceAdd:
[Service]
Environment="OLLAMA_KEEP_ALIVE=0"
Environment="OLLAMA_MAX_LOADED_MODELS=0"Router Mode Overview
Router Mode is llama.cpp's built-in multi-model manager, introduced in PR #17859 (December 2025). It transforms llama-server from a single-model server into a dynamic model router.
How It Works
- Start
llama-serverwithout specifying a model — this activates router mode - The router discovers available models from three sources (cache, directory, or preset file)
- When a request arrives, the router reads the
"model"field and loads/routes to the appropriate model - Each model runs as a separate child process — crash isolation is built in
- When
--models-maxis reached, the least-recently-used model is evicted automatically - Idle models can be auto-unloaded via
--sleep-idle-seconds
Quick Start
# Simplest: serve all models from a directory
llama-server --models-dir /root/models --host 0.0.0.0 --port 8012 --api-key your-secret-key
# With limits and idle timeout
llama-server \
--models-dir /root/models \
--models-max 2 \
--sleep-idle-seconds 900 \
--host 0.0.0.0 \
--port 8012 \
--api-key your-secret-key
# With per-model configuration
llama-server \
--models-preset /root/models.ini \
--models-max 2 \
--sleep-idle-seconds 900 \
--host 0.0.0.0 \
--port 8012 \
--api-key your-secret-keyKey Router Flags
| Flag | Default | Description |
|---|---|---|
--models-dir PATH | disabled | Directory containing GGUF models |
--models-preset PATH | disabled | INI file with per-model configurations |
--models-max N | 4 | Maximum models loaded simultaneously (0 = unlimited) |
--models-autoload | enabled | Auto-load models on first request |
--no-models-autoload | — | Require explicit POST /models/load to load models |
--sleep-idle-seconds N | -1 (disabled) | Auto-unload models after N seconds of inactivity |
--fit [on|off] | on | Auto-adjust parameters to fit device memory |
--fit-target MiB | 1024 | VRAM headroom to leave free per device |
Setting Up the Models Directory
Create the Directory Structure
# Inside the llama.cpp LXC container
mkdir -p /root/modelsDirectory Layout
/root/models/
│
│ # Single-file models (most common)
├── Qwen3-8B-Q4_K_M.gguf
├── Qwen3-32B-Q4_K_M.gguf
├── DeepSeek-R1-32B-Q4_K_M.gguf
├── DeepSeek-R1-8B-Q4_K_M.gguf
├── Llama-3.3-8B-Q4_K_M.gguf
│
│ # Multimodal models (subdirectory — mmproj file must start with "mmproj")
├── gemma-3-4b-it-Q8_0/
│ ├── gemma-3-4b-it-Q8_0.gguf
│ └── mmproj-F16.gguf
│
│ # Multi-shard models (subdirectory — for very large models split into parts)
└── Kimi-K2-Thinking-UD-IQ1_S/
├── Kimi-K2-Thinking-UD-IQ1_S-00001-of-00006.gguf
├── Kimi-K2-Thinking-UD-IQ1_S-00002-of-00006.gguf
└── ...Move Existing Models
# Option A: Move files
mv /root/hf/models/Qwen3-8B-GGUF/Qwen3-8B-Q4_K_M.gguf /root/models/
mv /root/hf/models/Qwen3-32B-GGUF/Qwen3-32B-Q4_K_M.gguf /root/models/
mv /root/hf/models/DeepSeek-R1-32B-GGUF/DeepSeek-R1-32B-Q4_K_M.gguf /root/models/
# Option B: Symlink (saves disk space, keeps originals)
ln -s /root/hf/models/Qwen3-8B-GGUF/Qwen3-8B-Q4_K_M.gguf /root/models/
ln -s /root/hf/models/Qwen3-32B-GGUF/Qwen3-32B-Q4_K_M.gguf /root/models/Download Models Directly via Cache
# Download and cache models (one-time — server exits after caching)
llama-server -hf unsloth/Qwen3-8B-GGUF:Q4_K_M &
sleep 30 && kill %1
llama-server -hf unsloth/Qwen3-32B-GGUF:Q4_K_M &
sleep 30 && kill %1
llama-server -hf unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_M &
sleep 30 && kill %1Model Presets — Per-Model Configuration
When using --models-dir, all models inherit the same global settings. For fine-grained control, use a preset INI file.
Create /root/models.ini:
version = 1
; ─── Global defaults applied to ALL models ───────────────────────
[*]
flash-attn = on
jinja = true
fit = on
n-gpu-layers = auto
ctx-size = 131072
cache-type-k = q8_0
cache-type-v = q8_0
threads = 12
; ─── Qwen3 8B — small fast model, quick questions, simple tasks ──
[qwen3-8b]
model = /root/hf/models/Qwen3-8B-GGUF/Qwen3-8B-Q4_K_M.gguf
load-on-startup = true
stop-timeout = 10
; ─── GLM-4 9B — general purpose, instruction-tuned ──────────────
[glm-4-9b]
model = /root/hf/models/GLM-4-9B-0414-GGUF/GLM-4-9B-0414-Q4_K_M.gguf
; ─── DeepSeek R1 8B — reasoning, chain-of-thought ───────────────
[deepseek-r1-8b]
model = /root/hf/models/DeepSeek-R1-0528-Qwen3-8B/DeepSeek-R1-0528-Qwen3-8B-BF16.gguf
; ─── GPT-OSS 20B — large general model ──────────────────────────
[gpt-oss-20b]
model = /root/hf/models/gpt-oss-20b/gpt-oss-20b-F16.ggufLaunch with Presets
llama-server \
--models-preset /root/models.ini \
--models-max 2 \
--sleep-idle-seconds 900 \
--host 0.0.0.0 \
--port 8012 \
--api-key your-secret-keyVRAM Budget Planning
With dual RTX 3090 (48 GB combined), VRAM is shared across both GPUs. Understanding how models consume VRAM determines your --models-max setting and context size choices.
Model Size Reference (Q4_K_M Quantization)
| Model | Parameters | VRAM (Q4_K_M) | Context 4K Extra | Context 8K Extra |
|---|---|---|---|---|
| Qwen3 1.7B | 1.7B | ~1.5 GB | — | +~250 MB |
| Llama 3.2 3B | 3B | ~2 GB | — | +~250 MB |
| Qwen3 8B / Llama 3.3 8B | 8B | ~5 GB | — | +~500 MB |
| DeepSeek R1 14B | 14B | ~9 GB | +~250 MB | +~750 MB |
| Qwen3 32B / DeepSeek R1 32B | 32B | ~19 GB | +~500 MB | +~1.5 GB |
| Llama 3.1 70B | 70B | ~40 GB | +~1 GB | +~3 GB |
Recommended --models-max for 48 GB VRAM (Dual RTX 3090)
| Model Size Class | Typical VRAM Each | Max Concurrent | Recommended --models-max | Notes |
|---|---|---|---|---|
| 1.7B-3B Q4_K_M | 1.5–2 GB | 4–6 | 4 | Room for large context |
| 7B-8B Q4_K_M | 4–5 GB | 3–4 | 3 | Good balance |
| 13B-14B Q4_K_M | 8–9 GB | 2 | 2 | Tight with context overhead |
| 32B Q4_K_M | 18–20 GB | 1 | 1 | Fills most of VRAM |
| 70B+ Q4_K_M | 40 GB+ | 1 | 1 | Fits on 48 GB; minimal CPU offload needed |
| Mixed (8B + 14B) | 5 + 9 = 14 GB | 2 | 2 | Practical for two-model setup |
Unified Memory (OOM Safety Net)
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1Loading and Unloading Strategies
Router Mode supports multiple strategies for managing which models are in VRAM.
Strategy A: Fully Automatic (LRU Eviction)
llama-server \
--models-dir /root/models \
--models-max 2 \
--host 0.0.0.0 --port 8012 --api-key your-secret-keyStrategy B: Idle Auto-Unload
llama-server \
--models-dir /root/models \
--models-max 4 \
--sleep-idle-seconds 900 \
--host 0.0.0.0 --port 8012 --api-key your-secret-keyStrategy C: Manual Control via API
# Load a specific model
curl -X POST http://localhost:8012/models/load \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-key" \
-d '{"model": "Qwen3-8B-Q4_K_M.gguf"}'
# Unload a model to free VRAM
curl -X POST http://localhost:8012/models/unload \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-key" \
-d '{"model": "Qwen3-8B-Q4_K_M.gguf"}'Strategy E: Combined (Recommended for Homelab)
llama-server \
--models-preset /root/models.ini \
--models-max 3 \
--sleep-idle-seconds 900 \
--host 0.0.0.0 --port 8012 --api-key your-secret-keysystemd Service for Router Mode
Replace the single-model llama-server.service from llama.cpp Inference On Proxmox with a router-mode service.
Option A: Using --models-dir (Simpler)
cat > /etc/systemd/system/llama-server.service << 'EOF'
[Unit]
Description=llama.cpp Router — Multi-Model Inference Server
After=network.target
[Service]
Type=simple
ExecStart=/root/llama.cpp/build/bin/llama-server \
--models-dir /root/models \
--models-max 2 \
--sleep-idle-seconds 900 \
--host 0.0.0.0 \
--port 8012 \
--api-key your-secret-key \
--flash-attn on \
--jinja \
--fit on \
--ctx-size 8192 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--threads 4
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal
Environment="CUDA_VISIBLE_DEVICES=0,1"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"
[Install]
WantedBy=multi-user.target
EOFOption B: Using --models-preset (Recommended)
cat > /etc/systemd/system/llama-server.service << 'EOF'
[Unit]
Description=llama.cpp Router — Multi-Model Inference Server
After=network.target
[Service]
Type=simple
ExecStart=/root/llama.cpp/build/bin/llama-server \
--models-preset /root/models.ini \
--models-max 3 \
--sleep-idle-seconds 900 \
--host 0.0.0.0 \
--port 8012 \
--api-key your-secret-key \
--threads 4
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal
Environment="CUDA_VISIBLE_DEVICES=0,1"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"
[Install]
WantedBy=multi-user.target
EOFEnable and Start
systemctl daemon-reload
systemctl enable llama-server
systemctl start llama-serverOpen WebUI Integration
Router Mode dramatically simplifies Open WebUI integration — you need only one connection, and all models appear automatically.
Add a Single Connection
- Open your Open WebUI instance:
http://<openwebui-IP>:8080 - Click your profile icon → Admin Panel → Settings → Connections
- Under OpenAI API, click + to add a new connection:
| Field | Value |
|---|---|
| URL | http://<llama-cpp-LXC-IP>:8012/v1 |
| API Key | your-secret-key (must match --api-key) |
- Click Save
All Models Appear Automatically
Open WebUI queries GET /v1/models on the llama-server. In router mode, this returns all discovered models — not just the loaded one. Users can select any model from the dropdown, and the router handles loading on demand.
Rename Models (Recommended)
- Go to Admin Panel → Settings → Models
- Click the pencil icon next to a model
- Edit the Name to something readable (e.g.,
Qwen3 8B,DeepSeek R1 32B) - Click Save
Monitoring and Health Checks
Model Status
# List all models with their status (loaded/unloaded/loading/failed)
curl -s http://localhost:8012/v1/models \
-H "Authorization: Bearer your-secret-key" | python3 -m json.toolServer Health
# Health check (does not trigger model reload, does not reset idle timer)
curl http://localhost:8012/health
# Expected: {"status": "ok"}Prometheus Metrics
Enable metrics with the --metrics flag in your systemd service:
# Add to ExecStart
--metricsThen scrape:
curl http://localhost:8012/metricsGPU Monitoring
# Real-time VRAM and GPU usage (run on host or inside LXC)
watch -n 1 nvidia-smi
# Better terminal UI
nvtopQuick Health Script
cat > /root/check-llama.sh << 'SCRIPT'
#!/bin/bash
API_KEY="your-secret-key"
BASE="http://localhost:8012"
echo "=== Server Health ==="
curl -s "$BASE/health" | python3 -m json.tool
echo -e "\n=== Loaded Models ==="
curl -s "$BASE/v1/models" -H "Authorization: Bearer $API_KEY" \
| python3 -c "
import sys, json
data = json.load(sys.stdin)
for m in data.get('data', []):
status = m.get('status', {}).get('value', 'unknown')
print(f\" {m['id']:40s} [{status}]\")
"
echo -e "\n=== GPU Memory ==="
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader
SCRIPT
chmod +x /root/check-llama.shAdvanced: Multiple Instances (When Router Isn't Enough)
For specific use cases, you may need dedicated llama-server instances alongside or instead of router mode.
systemd Template Unit
cat > /etc/systemd/system/llama-server@.service << 'EOF'
[Unit]
Description=llama.cpp Instance — %i
After=network.target
[Service]
Type=simple
EnvironmentFile=/etc/llama-server/%i.env
ExecStart=/root/llama.cpp/build/bin/llama-server \
--model ${MODEL_PATH} \
--port ${PORT} \
--host 0.0.0.0 \
--api-key ${API_KEY} \
--n-gpu-layers ${GPU_LAYERS} \
--ctx-size ${CTX_SIZE} \
--threads ${THREADS} \
--flash-attn on \
--jinja \
--fit on
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal
Environment="CUDA_VISIBLE_DEVICES=${CUDA_DEVICES}"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"
[Install]
WantedBy=multi-user.target
EOFUpdating and Maintenance
# --- From the Proxmox host ---
# 1. Take a snapshot before touching anything
pct snapshot <CTID> pre-llamacpp-update-$(date +%Y%m%d) --description "Before llama.cpp update"
# 2. Enter the container
pct enter <CTID>
# --- Inside the container ---
# 3. Stop the service before replacing the binary
systemctl stop llama-server
# 4. Pull latest commits
cd /root/llama.cpp
git pull
# 5. Full CMake reconfigure — picks up new build flags from the updated source
cmake -S . -B build \
-DGGML_CUDA=ON \
-DLLAMA_CURL=ON \
-DLLAMA_OPENSSL=ON \
-DGGML_CUDA_F16=ON \
-DCMAKE_CUDA_ARCHITECTURES="86"
# 6. Rebuild (all cores)
cmake --build build --config Release -j$(nproc)
# 7. Start the service
systemctl start llama-server
# 8. Verify — check status, logs, API, and GPU
systemctl status llama-server
journalctl -u llama-server -n 50
curl http://localhost:8012/v1/models | python3 -m json.tool
nvidia-smiSupport And Additional Resources
Router Mode Documentation
- llama.cpp Model Management Blog Post: https://huggingface.co/blog/ggml-org/model-management-in-llamacpp
- llama.cpp Server README (Router Mode section): https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md
- Router Mode PR: https://github.com/ggml-org/llama.cpp/pull/17859
- VRAM-Aware Eviction Discussion: https://github.com/ggml-org/llama.cpp/discussions/19425
Related Topics
- llama.cpp Inference On Proxmox — the single-model setup this page builds on.
- Router Mode Deployment Example — the concrete dual-RTX-3090 deployment checklist and live config example.
- Open WebUI And Ollama On Proxmox — the simpler bundled stack if Router Mode is unnecessary.
- Open WebUI Standalone Frontend On Proxmox — the split browser layer when Router Mode is already handling inference somewhere else.
- OpenClaw On Proxmox — the assistant layer when Telegram, Discord, and scheduled workflows should sit on top of Router Mode.
- GPU Passthrough On Proxmox — the host-side NVIDIA and LXC groundwork that still applies here. *** End Patch