llama.cpp On Proxmox

This section is for the moment when Ollama stops feeling like enough abstraction.

You still want a clean Proxmox deployment, but you also want tighter control over model format, GPU offloading, context size, build flags, and how the server actually behaves once it is under load.

If you want the easier bundled path, use Open WebUI And Ollama On Proxmox. If you want one container serving multiple models dynamically, continue later to Router Mode.

The host-side GPU work still belongs in GPU Passthrough On Proxmox. This section starts once that foundation is already trustworthy.

What Is llama.cpp

llama.cpp is a high-performance LLM inference engine written in pure C/C++ by Georgi Gerganov. It has zero external dependencies and runs large language models using the GGUF quantized model format.

Key Characteristics

Pure C/C++ — no Python runtime, no framework overhead
Quantization — supports 1.5-bit through 8-bit integer quantization, drastically reducing model size and VRAM usage while preserving quality
Hardware breadth — NVIDIA CUDA, AMD ROCm/HIP, Apple Metal, Vulkan, CPU AVX/AVX2/AVX512, and more
OpenAI-compatible API — llama-server exposes /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models endpoints
Multi-GPU — automatic tensor splitting across multiple GPUs
CPU+GPU hybrid — offload as many layers as VRAM allows, spill the rest to system RAM + CPU

Why Use It

llama.cpp gives you direct, granular control over every aspect of model serving — GPU layer offloading, context window size, thread count, sampling parameters, batch size, and build-time CUDA architecture targeting. It is always the first project to support new GGUF features and model architectures because most other tools (including Ollama) are built on top of it.

llama.cpp vs Ollama

Ollama is a convenience wrapper built on top of llama.cpp. It adds model management (ollama pull), a simple API, and automatic configuration — but abstracts away the low-level controls that matter for performance tuning.

Aspect	llama.cpp	Ollama
Core	The inference engine itself (C/C++)	Wrapper around llama.cpp (Go + llama.cpp)
Setup	Build from source, manual config	One-line install, batteries-included
Model management	Download GGUF files manually from Hugging Face	`ollama pull model-name` from Ollama registry
Model format	Any GGUF from any source (Hugging Face, Unsloth, TheBloke, etc.)	Ollama registry models only (or import via `Modelfile`)
GPU layer control	`--n-gpu-layers N` — precise per-layer offloading	Automatic (limited override via `OLLAMA_NUM_GPU`)
Context size	`--ctx-size N` — set any value your VRAM supports	Defaults to 2048; override via API parameter
Thread control	`--threads N` — pin to physical core count	Automatic
Sampling params	Full control: `--temp`, `--top-p`, `--top-k`, `--repeat-penalty`, `--seed`, etc.	Subset available via API or `Modelfile`
Multi-GPU	Native tensor splitting across GPUs	Supported but less configurable
Build-time tuning	Target specific CUDA arch (`CMAKE_CUDA_ARCHITECTURES`), enable/disable features	Pre-built binary, no compile-time tuning
Quantization choices	Any GGUF quant: Q2_K, Q3_K_S, Q4_K_M, Q5_K_M, Q6_K, Q8_0, IQ variants, etc.	Limited to what the Ollama registry offers
Update cadence	Upstream — always first to support new models/features	Lags behind llama.cpp by days to weeks
API compatibility	OpenAI-compatible (`/v1/...`)	Ollama API + partial OpenAI compatibility
Overhead	Minimal — runs the binary directly	Go runtime + process management layer
Best for	Production serving, performance tuning, multi-GPU rigs, advanced users	Quick experimentation, simple single-GPU setups, beginners

Recommendation: Use Ollama when you want a quick "just works" setup to experiment with models. Switch to llama.cpp when you need maximum performance, full control over inference parameters, access to any GGUF model from Hugging Face, or are running multi-GPU hardware where every configuration knob matters.

In This Section

Start with the container and GPU setup, then layer in CUDA, build the binary, and connect Open WebUI. Router Mode builds on that foundation.

Container Setup — create the LXC container and configure GPU passthrough.
CUDA And Driver Install — install CUDA Toolkit 12.8 and the NVIDIA userspace driver inside the container.
Build And Serve — build llama.cpp from source, download GGUF models, and run the server.
Open WebUI Integration — connect llama-server to Open WebUI and tune performance.
Limited VRAM Playbook (Large Models) — run large models on small or older GPUs with practical memory and stability tuning.
llama-server Parameters Explained — research-backed guide to the most-used flags and how to tune them safely.
Router Mode — multi-model serving with on-demand loading, LRU eviction, and a single Open WebUI connection.
Router Mode Deployment Example — concrete dual-RTX-3090 deployment walkthrough.