Open WebUI Integration
Connect llama-server to Open WebUI and tune GPU offloading, unified memory, multi-GPU setup, and update discipline for ongoing maintenance.
Published May 28, 2025 · Updated June 18, 2025
Open WebUI Integration
llama-server exposes an OpenAI-compatible API, so Open WebUI can use it as an "OpenAI" connection.
Connect to Open WebUI
Step 1: Add the Connection
- Open your Open WebUI instance in a browser:
http://<openwebui-IP>:8080 - Click your profile icon (bottom-left) → Admin Panel → Settings → Connections
- Under OpenAI API, click + to add a new connection:
| Field | Value |
|---|---|
| URL | http://<llama-cpp-LXC-IP>:8012/v1 |
| API Key | your-secret-key (must match --api-key) |
| Prefix | llamacpp- (helps distinguish from Ollama models) |
- Click Save
Finding the llama-cpp container IP:
# From the Proxmox host pct exec <CTID> -- hostname -I
Step 2: Rename Models (Optional)
- Go to Admin Panel → Settings → Models
- Click the pencil icon next to the model
- Edit the Name to something short (e.g.,
Qwen3 8B Q4) - Click Save
Step 3: Verify
- Go to the chat interface
- Select the llama.cpp model from the model dropdown
- Send a test message — you should get a response
- Monitor GPU usage on the host:
nvtop
Best Practices And Performance Tuning
Snapshot After Setup
# From host — replace 101 with your CT ID
pct snapshot <CTID> working-base --description "llama.cpp built, CUDA working, model loaded"GPU Layer Offloading
# Real-time GPU monitoring (run on host)
watch -n 1 nvidia-smi
# Or use nvtop for a nicer view
nvtopUnified Memory (VRAM Overflow to RAM)
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1Multi-GPU Setup
# Use all GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3
# Use only GPU 0 and 1
export CUDA_VISIBLE_DEVICES=0,1export CUDA_SCALE_LAUNCH_QUEUES=4xUpdating llama.cpp
# 1. Take a snapshot first (from the Proxmox host before entering the container)
# pct snapshot <CTID> pre-llamacpp-update-$(date +%Y%m%d) --description "Before llama.cpp update"
# 2. Stop the service before replacing the binary
systemctl stop llama-server
# 3. Pull latest changes
cd /root/llama.cpp
git pull
# 4. Full CMake reconfigure — picks up any new build flags from the updated source
cmake -S . -B build \
-DGGML_CUDA=ON \
-DLLAMA_CURL=ON \
-DLLAMA_OPENSSL=ON \
-DGGML_CUDA_F16=ON \
-DCMAKE_CUDA_ARCHITECTURES="86"
# 5. Rebuild (all cores)
cmake --build build --config Release -j$(nproc)
# 6. Start the service
systemctl start llama-server
# 7. Verify
systemctl status llama-server
journalctl -u llama-server -n 50
curl http://localhost:8012/v1/models | python3 -m json.tool
nvidia-smiRunning Multiple Models
You can run multiple llama-server instances on different ports, each serving a different model:
# Server 1: Small fast model for simple queries
./llama.cpp/build/bin/llama-server \
--model /root/models/qwen3-1.7b-q4.gguf \
--port 8012 --host 0.0.0.0 --api-key key1 \
--n-gpu-layers 99
# Server 2: Large model for complex reasoning
./llama.cpp/build/bin/llama-server \
--model /root/models/qwen3-32b-q4.gguf \
--port 8013 --host 0.0.0.0 --api-key key2 \
--n-gpu-layers 40Better approach: Use llama.cpp's built-in Router Mode instead of running multiple instances. Router Mode manages multiple models in a single process with automatic LRU eviction, idle unloading, and per-model configuration.
Related Topics
- Open WebUI And Ollama On Proxmox — the simpler bundled stack when you want faster setup and less control.
- Limited VRAM Playbook (Large Models) — memory-focused tuning patterns for running larger models on constrained GPUs.
- Router Mode — the multi-model path once a single-model server starts feeling too limiting.
- GPU Passthrough On Proxmox — the host-side NVIDIA groundwork this container relies on.
- Update And Maintenance — where host-first updates and driver version discipline live.
- Container Network Throttling — useful when large model downloads should not take the whole link while this container is filling
/root/models.