Open WebUI Integration

Connect llama-server to Open WebUI and tune GPU offloading, unified memory, multi-GPU setup, and update discipline for ongoing maintenance.

Published May 28, 2025 · Updated June 18, 2025

Open WebUI Integration

llama-server exposes an OpenAI-compatible API, so Open WebUI can use it as an "OpenAI" connection.

Connect to Open WebUI

Step 1: Add the Connection

  1. Open your Open WebUI instance in a browser: http://<openwebui-IP>:8080
  2. Click your profile icon (bottom-left) → Admin PanelSettingsConnections
  3. Under OpenAI API, click + to add a new connection:
FieldValue
URLhttp://<llama-cpp-LXC-IP>:8012/v1
API Keyyour-secret-key (must match --api-key)
Prefixllamacpp- (helps distinguish from Ollama models)
  1. Click Save

Finding the llama-cpp container IP:

# From the Proxmox host
pct exec <CTID> -- hostname -I

Step 2: Rename Models (Optional)

  1. Go to Admin PanelSettingsModels
  2. Click the pencil icon next to the model
  3. Edit the Name to something short (e.g., Qwen3 8B Q4)
  4. Click Save

Step 3: Verify

  1. Go to the chat interface
  2. Select the llama.cpp model from the model dropdown
  3. Send a test message — you should get a response
  4. Monitor GPU usage on the host: nvtop

Best Practices And Performance Tuning

Snapshot After Setup

# From host — replace 101 with your CT ID
pct snapshot <CTID> working-base --description "llama.cpp built, CUDA working, model loaded"

GPU Layer Offloading

# Real-time GPU monitoring (run on host)
watch -n 1 nvidia-smi
 
# Or use nvtop for a nicer view
nvtop

Unified Memory (VRAM Overflow to RAM)

export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

Multi-GPU Setup

# Use all GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3
 
# Use only GPU 0 and 1
export CUDA_VISIBLE_DEVICES=0,1
export CUDA_SCALE_LAUNCH_QUEUES=4x

Updating llama.cpp

# 1. Take a snapshot first (from the Proxmox host before entering the container)
# pct snapshot <CTID> pre-llamacpp-update-$(date +%Y%m%d) --description "Before llama.cpp update"
 
# 2. Stop the service before replacing the binary
systemctl stop llama-server
 
# 3. Pull latest changes
cd /root/llama.cpp
git pull
 
# 4. Full CMake reconfigure — picks up any new build flags from the updated source
cmake -S . -B build \
  -DGGML_CUDA=ON \
  -DLLAMA_CURL=ON \
  -DLLAMA_OPENSSL=ON \
  -DGGML_CUDA_F16=ON \
  -DCMAKE_CUDA_ARCHITECTURES="86"
 
# 5. Rebuild (all cores)
cmake --build build --config Release -j$(nproc)
 
# 6. Start the service
systemctl start llama-server
 
# 7. Verify
systemctl status llama-server
journalctl -u llama-server -n 50
curl http://localhost:8012/v1/models | python3 -m json.tool
nvidia-smi

Running Multiple Models

You can run multiple llama-server instances on different ports, each serving a different model:

# Server 1: Small fast model for simple queries
./llama.cpp/build/bin/llama-server \
  --model /root/models/qwen3-1.7b-q4.gguf \
  --port 8012 --host 0.0.0.0 --api-key key1 \
  --n-gpu-layers 99
 
# Server 2: Large model for complex reasoning
./llama.cpp/build/bin/llama-server \
  --model /root/models/qwen3-32b-q4.gguf \
  --port 8013 --host 0.0.0.0 --api-key key2 \
  --n-gpu-layers 40

Better approach: Use llama.cpp's built-in Router Mode instead of running multiple instances. Router Mode manages multiple models in a single process with automatic LRU eviction, idle unloading, and per-model configuration.

Comments

Sign in with GitHub to leave a comment or reaction.