Router Mode Deployment Example
A concrete Router Mode deployment example for llama.cpp on Proxmox, using Qwen3, GLM-4, DeepSeek-R1, and gpt-oss on RTX 3090 hardware.
Published January 14, 2025
Router Mode Deployment Example
This page is the concrete version of the Router Mode story.
The main guide explains the architecture and the tradeoffs. This page keeps one tested Router Mode deployment checklist intact so you can follow a concrete dual-RTX-3090 path end to end.
Read llama.cpp Router Mode On Proxmox first if you still need the reasoning behind the layout.
Date: January 14, 2025
Hardware: NVIDIA RTX 3090 (24 GB VRAM)
Models: Qwen3-8B, GLM-4-9B, DeepSeek-R1-Qwen3-8B, gpt-oss-20b
Target: Router Mode with LRU eviction on 2 concurrent models
Pre-Deployment Checklist
- SSH access to llama-cpp LXC container (CT 105 or similar)
- llama.cpp built from source after December 11, 2025 (Router Mode support)
- 4 GGUF model files already in
/root/hf/models/(verify vials -lhR) - Open WebUI running and connected to port 8012 (or will be reconfigured)
- Backup current
/etc/systemd/system/llama-server.serviceif it exists -
CUDA_VISIBLE_DEVICES=0verified on host GPU device
Step 1: Verify llama.cpp Version And Router Mode Support
On the llama-cpp container, check the binary version:
# Login to container if not already done
pct enter 105 # or appropriate CT ID
# Check version (build date should be >= Dec 11, 2025)
/root/llama.cpp/build/bin/llama-server --versionIf older, rebuild llama.cpp from source:
cd /root/llama.cpp
git pull
mkdir -p build
cd build
cmake .. -DGGML_CUDA=ON -DGGML_CUDA_F16=ON
make -j$(nproc)Step 2: Verify Model Files And Integrity
ls -lhR /root/hf/models/Discard empty directories:
rmdir /root/hf/models/Llama-3.1-8B-Instruct-GGUF
rmdir /root/hf/models/Qwen3-32B-GGUFStep 3: Create Configuration Directory
mkdir -p /root/configs
cd /root/configsStep 4: Create models.ini (Preset Configuration)
Copy the following into /root/configs/models.ini:
version = 1
; ─── Global defaults applied to ALL models ───────────────────────
[*]
flash-attn = on
jinja = true
fit = on
cache-type-k = q8_0
cache-type-v = q8_0
threads = 4
; ─── Qwen3-8B (Fast, General Purpose) ─────────────────────────────
; Primary model: always loaded on startup
; VRAM: 5–7 GB | Context: 32K–131K (with YaRN rope scaling 4x)
[qwen3-8b]
model = /root/hf/models/Qwen3-8B-GGUF/Qwen3-8B-Q4_K_M.gguf
ctx-size = 32768
rope-scaling = linear
rope-scale = 4.0
load-on-startup = true
stop-timeout = 10
; ─── GLM-4-9B (Balanced, High-Quality) ────────────────────────────
; Secondary model: loads on demand
; VRAM: 6–8 GB | Context: 8K–128K (ALiBi position interpolation)
[glm-4-9b]
model = /root/hf/models/GLM-4-9B-0414-GGUF/GLM-4-9B-0414-Q4_K_M.gguf
ctx-size = 128000
load-on-startup = true
stop-timeout = 10
; ─── DeepSeek-R1-Qwen3-8B (Deep Reasoning) ────────────────────────
; High-capability reasoning model; loads on-demand (evicts smaller models)
; VRAM: 18–20 GB | Context: 32K–64K
; CRITICAL: temperature=0.6 is required (enforce in Open WebUI)
[deepseek-r1-qwen3-8b]
model = /root/hf/models/DeepSeek-R1-0528-Qwen3-8B/DeepSeek-R1-0528-Qwen3-8B-BF16.gguf
ctx-size = 64000
load-on-startup = false
stop-timeout = 20
; ─── GPT-OSS-20B (Production Reasoning, MoE) ──────────────────────
; Advanced reasoning with function calling; loads on-demand (exclusive VRAM)
; VRAM: 16 GB | Context: 4K–8K (optimized for MoE sparse activation)
; CRITICAL: temperature=1.0 and top_p=1.0 are non-negotiable
; CRITICAL: must use Harmony format prompting
[gpt-oss-20b]
model = /root/hf/models/gpt-oss-20b/gpt-oss-20b-F16.gguf
ctx-size = 8192
harmony = true
load-on-startup = false
stop-timeout = 30Create the file:
cat > /root/configs/models.ini << 'INIEOF'
[PASTE CONTENT FROM ABOVE]
INIEOFVerify:
cat /root/configs/models.iniStep 5: Update systemd Service
Backup the current service (if exists):
cp /etc/systemd/system/llama-server.service /etc/systemd/system/llama-server.service.backupCreate new service file:
cat > /etc/systemd/system/llama-server.service << 'SERVICEEOF'
[Unit]
Description=llama.cpp Inference Server (Router Mode Multi-Model)
After=network.target
Documentation=https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md
[Service]
Type=simple
User=root
WorkingDirectory=/root
ExecStart=/root/llama.cpp/build/bin/llama-server \
--models-preset /root/configs/models.ini \
--models-max 2 \
--sleep-idle-seconds 600 \
--port 8012 \
--host 0.0.0.0 \
--api-key your-secret-key \
--metrics
Environment="CUDA_VISIBLE_DEVICES=0"
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=llama-server
KillMode=process
KillSignal=SIGTERM
TimeoutStopSec=60
[Install]
WantedBy=multi-user.target
SERVICEEOFStep 6: Enable And Start The Service
# Reload systemd daemon
systemctl daemon-reload
# Enable auto-start on boot
systemctl enable llama-server
# Start the service
systemctl start llama-server
# Check status (should show "active (running)")
systemctl status llama-serverStep 7: Verify Service Is Running
Check service status:
systemctl status llama-serverView real-time logs:
journalctl -u llama-server -fExpected log output:
llama-server: Starting router mode with --models-preset /root/configs/models.ini
llama-server: Router mode enabled, max 2 models
llama-server: Loading preset file: /root/configs/models.ini
llama-server: Discovered model [qwen3-8b] at /root/hf/models/Qwen3-8B-GGUF/Qwen3-8B-Q4_K_M.gguf
llama-server: Discovered model [glm-4-9b] at /root/hf/models/GLM-4-9B-0414-GGUF/GLM-4-9B-0414-Q4_K_M.gguf
llama-server: Discovered model [deepseek-r1-qwen3-8b] at /root/hf/models/DeepSeek-R1-0528-Qwen3-8B/DeepSeek-R1-0528-Qwen3-8B-BF16.gguf
llama-server: Discovered model [gpt-oss-20b] at /root/hf/models/gpt-oss-20b/gpt-oss-20b-F16.gguf
llama-server: Server listening on port 8012Step 8: Test API Endpoint
Check available models:
curl -s http://localhost:8012/v1/models | python3 -m json.toolExpected status after startup:
qwen3-8b:"loaded"glm-4-9b:"loaded"deepseek-r1-qwen3-8b:"unloaded"gpt-oss-20b:"unloaded"
Step 9: Test Model Inference
Test Qwen3-8B (should respond immediately, already loaded):
curl -s -X POST http://localhost:8012/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-8b",
"messages": [{"role": "user", "content": "What is 2+2?"}],
"max_tokens": 100,
"temperature": 0.7
}' | python3 -m json.toolTest DeepSeek-R1 (will load first time, takes ~15-30 sec):
curl -s -X POST http://localhost:8012/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-qwen3-8b",
"messages": [{"role": "user", "content": "Explain quantum computing in simple terms"}],
"max_tokens": 500,
"temperature": 0.6
}' | python3 -m json.toolStep 10: Update Open WebUI Connection
- Open Open WebUI:
http://<openwebui-IP>:8080 - Go to Profile → Admin Panel → Settings → Connections
- Find the old Ollama/llama-server connection and delete it
- Click + to add new connection:
- URL:
http://<llama-cpp-container-IP>:8012/v1 - API Key:
your-secret-key(must match service)
- URL:
- Click Save
- Go to Models → you should see all 4 models in the dropdown:
qwen3-8bglm-4-9bdeepseek-r1-qwen3-8bgpt-oss-20b
Step 11: Configure Model-Specific Sampling In Open WebUI
Qwen3-8B: Flexible
- Temperature: 0.6–0.8 (reasoning) or 0.7 (default)
- Top P: 0.8–0.95
- Top K: 20–50
GLM-4-9B: Flexible
- Temperature: 0.6–0.9
- Top P: 0.85–0.95
- Top K: 30–100
DeepSeek-R1: STRICT
- Temperature: 0.6 (FIXED — do NOT change)
- Top P: 0.95 (recommended)
- Top K: 40 (recommended)
gpt-oss-20b: STRICT
- Temperature: 1.0 (FIXED — non-negotiable)
- Top P: 1.0 (FIXED — non-negotiable)
- Top K: null (no top-k filtering)
Step 12: Monitor VRAM And Model Loading
# On the Proxmox HOST (not container)
watch -n 1 nvidia-smi
# Or in the llama-cpp CONTAINER
nvidia-smi -i 0 -l 1Watch the service logs for load/unload events:
journalctl -u llama-server -f | grep -E "(loading|unloading|evicting|sleep)"Troubleshooting
Problem: Service fails to start
journalctl -u llama-server -n 50 --no-pagerProblem: VRAM not freeing when switching models
curl -X POST http://localhost:8012/models/unload \
-H "Content-Type: application/json" \
-d '{"model": "qwen3-8b"}'Performance Expectations
| Model | First Load | Cold Reload | Inference Speed | Best For |
|---|---|---|---|---|
| Qwen3-8B | 3–5 sec | 2–3 sec | ~15–20 tok/sec | Fast chat, quick Q&A |
| GLM-4-9B | 4–6 sec | 2–4 sec | ~12–18 tok/sec | Balanced, instruction-following |
| DeepSeek-R1 | 20–30 sec | 10–15 sec | ~5–10 tok/sec | Complex reasoning, research |
| gpt-oss-20b | 20–35 sec | 15–20 sec | ~8–15 tok/sec | Production reasoning, tools |
Next Steps
- Monitor for 24+ hours; check VRAM allocation patterns
- Adjust context sizes per-model if needed (conservative defaults already set)
- Fine-tune
--sleep-idle-secondsbased on usage patterns - Document your workflow (which model for which task) in Open WebUI model descriptions
- Consider quantized versions of large models if VRAM becomes a bottleneck (DeepSeek-R1 Q4_K_M = ~5 GB)
Related Topics
- llama.cpp Router Mode On Proxmox — the canonical architecture and operations guide this example sits under.
- llama.cpp Inference On Proxmox — the single-model setup you should already have working before using this example.
- GPU Passthrough On Proxmox — the host-side NVIDIA groundwork that still applies here. *** End Patch