llama.cpp On Proxmox
High-performance LLM inference with full control over GGUF models, build flags, GPU offloading, and server behavior — deployed in a Proxmox LXC container.
Published May 28, 2025 · Updated June 18, 2025
llama.cpp On Proxmox
This section is for the moment when Ollama stops feeling like enough abstraction.
You still want a clean Proxmox deployment, but you also want tighter control over model format, GPU offloading, context size, build flags, and how the server actually behaves once it is under load.
If you want the easier bundled path, use Open WebUI And Ollama On Proxmox. If you want one container serving multiple models dynamically, continue later to Router Mode.
The host-side GPU work still belongs in GPU Passthrough On Proxmox. This section starts once that foundation is already trustworthy.
What Is llama.cpp
llama.cpp is a high-performance LLM inference engine written in pure C/C++ by Georgi Gerganov. It has zero external dependencies and runs large language models using the GGUF quantized model format.
Key Characteristics
- Pure C/C++ — no Python runtime, no framework overhead
- Quantization — supports 1.5-bit through 8-bit integer quantization, drastically reducing model size and VRAM usage while preserving quality
- Hardware breadth — NVIDIA CUDA, AMD ROCm/HIP, Apple Metal, Vulkan, CPU AVX/AVX2/AVX512, and more
- OpenAI-compatible API —
llama-serverexposes/v1/chat/completions,/v1/completions,/v1/embeddings, and/v1/modelsendpoints - Multi-GPU — automatic tensor splitting across multiple GPUs
- CPU+GPU hybrid — offload as many layers as VRAM allows, spill the rest to system RAM + CPU
Why Use It
llama.cpp gives you direct, granular control over every aspect of model serving — GPU layer offloading, context window size, thread count, sampling parameters, batch size, and build-time CUDA architecture targeting. It is always the first project to support new GGUF features and model architectures because most other tools (including Ollama) are built on top of it.
llama.cpp vs Ollama
Ollama is a convenience wrapper built on top of llama.cpp. It adds model management (ollama pull), a simple API, and automatic configuration — but abstracts away the low-level controls that matter for performance tuning.
| Aspect | llama.cpp | Ollama |
|---|---|---|
| Core | The inference engine itself (C/C++) | Wrapper around llama.cpp (Go + llama.cpp) |
| Setup | Build from source, manual config | One-line install, batteries-included |
| Model management | Download GGUF files manually from Hugging Face | ollama pull model-name from Ollama registry |
| Model format | Any GGUF from any source (Hugging Face, Unsloth, TheBloke, etc.) | Ollama registry models only (or import via Modelfile) |
| GPU layer control | --n-gpu-layers N — precise per-layer offloading | Automatic (limited override via OLLAMA_NUM_GPU) |
| Context size | --ctx-size N — set any value your VRAM supports | Defaults to 2048; override via API parameter |
| Thread control | --threads N — pin to physical core count | Automatic |
| Sampling params | Full control: --temp, --top-p, --top-k, --repeat-penalty, --seed, etc. | Subset available via API or Modelfile |
| Multi-GPU | Native tensor splitting across GPUs | Supported but less configurable |
| Build-time tuning | Target specific CUDA arch (CMAKE_CUDA_ARCHITECTURES), enable/disable features | Pre-built binary, no compile-time tuning |
| Quantization choices | Any GGUF quant: Q2_K, Q3_K_S, Q4_K_M, Q5_K_M, Q6_K, Q8_0, IQ variants, etc. | Limited to what the Ollama registry offers |
| Update cadence | Upstream — always first to support new models/features | Lags behind llama.cpp by days to weeks |
| API compatibility | OpenAI-compatible (/v1/...) | Ollama API + partial OpenAI compatibility |
| Overhead | Minimal — runs the binary directly | Go runtime + process management layer |
| Best for | Production serving, performance tuning, multi-GPU rigs, advanced users | Quick experimentation, simple single-GPU setups, beginners |
Recommendation: Use Ollama when you want a quick "just works" setup to experiment with models. Switch to llama.cpp when you need maximum performance, full control over inference parameters, access to any GGUF model from Hugging Face, or are running multi-GPU hardware where every configuration knob matters.
In This Section
Start with the container and GPU setup, then layer in CUDA, build the binary, and connect Open WebUI. Router Mode builds on that foundation.
- Container Setup — create the LXC container and configure GPU passthrough.
- CUDA And Driver Install — install CUDA Toolkit 12.8 and the NVIDIA userspace driver inside the container.
- Build And Serve — build llama.cpp from source, download GGUF models, and run the server.
- Open WebUI Integration — connect llama-server to Open WebUI and tune performance.
- Limited VRAM Playbook (Large Models) — run large models on small or older GPUs with practical memory and stability tuning.
- llama-server Parameters Explained — research-backed guide to the most-used flags and how to tune them safely.
- Router Mode — multi-model serving with on-demand loading, LRU eviction, and a single Open WebUI connection.
- Router Mode Deployment Example — concrete dual-RTX-3090 deployment walkthrough.