llama.cpp On Proxmox

High-performance LLM inference with full control over GGUF models, build flags, GPU offloading, and server behavior — deployed in a Proxmox LXC container.

Published May 28, 2025 · Updated June 18, 2025

llama.cpp On Proxmox

This section is for the moment when Ollama stops feeling like enough abstraction.

You still want a clean Proxmox deployment, but you also want tighter control over model format, GPU offloading, context size, build flags, and how the server actually behaves once it is under load.

If you want the easier bundled path, use Open WebUI And Ollama On Proxmox. If you want one container serving multiple models dynamically, continue later to Router Mode.

The host-side GPU work still belongs in GPU Passthrough On Proxmox. This section starts once that foundation is already trustworthy.

What Is llama.cpp

llama.cpp is a high-performance LLM inference engine written in pure C/C++ by Georgi Gerganov. It has zero external dependencies and runs large language models using the GGUF quantized model format.

Key Characteristics

  • Pure C/C++ — no Python runtime, no framework overhead
  • Quantization — supports 1.5-bit through 8-bit integer quantization, drastically reducing model size and VRAM usage while preserving quality
  • Hardware breadth — NVIDIA CUDA, AMD ROCm/HIP, Apple Metal, Vulkan, CPU AVX/AVX2/AVX512, and more
  • OpenAI-compatible APIllama-server exposes /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models endpoints
  • Multi-GPU — automatic tensor splitting across multiple GPUs
  • CPU+GPU hybrid — offload as many layers as VRAM allows, spill the rest to system RAM + CPU

Why Use It

llama.cpp gives you direct, granular control over every aspect of model serving — GPU layer offloading, context window size, thread count, sampling parameters, batch size, and build-time CUDA architecture targeting. It is always the first project to support new GGUF features and model architectures because most other tools (including Ollama) are built on top of it.

llama.cpp vs Ollama

Ollama is a convenience wrapper built on top of llama.cpp. It adds model management (ollama pull), a simple API, and automatic configuration — but abstracts away the low-level controls that matter for performance tuning.

Aspectllama.cppOllama
CoreThe inference engine itself (C/C++)Wrapper around llama.cpp (Go + llama.cpp)
SetupBuild from source, manual configOne-line install, batteries-included
Model managementDownload GGUF files manually from Hugging Faceollama pull model-name from Ollama registry
Model formatAny GGUF from any source (Hugging Face, Unsloth, TheBloke, etc.)Ollama registry models only (or import via Modelfile)
GPU layer control--n-gpu-layers N — precise per-layer offloadingAutomatic (limited override via OLLAMA_NUM_GPU)
Context size--ctx-size N — set any value your VRAM supportsDefaults to 2048; override via API parameter
Thread control--threads N — pin to physical core countAutomatic
Sampling paramsFull control: --temp, --top-p, --top-k, --repeat-penalty, --seed, etc.Subset available via API or Modelfile
Multi-GPUNative tensor splitting across GPUsSupported but less configurable
Build-time tuningTarget specific CUDA arch (CMAKE_CUDA_ARCHITECTURES), enable/disable featuresPre-built binary, no compile-time tuning
Quantization choicesAny GGUF quant: Q2_K, Q3_K_S, Q4_K_M, Q5_K_M, Q6_K, Q8_0, IQ variants, etc.Limited to what the Ollama registry offers
Update cadenceUpstream — always first to support new models/featuresLags behind llama.cpp by days to weeks
API compatibilityOpenAI-compatible (/v1/...)Ollama API + partial OpenAI compatibility
OverheadMinimal — runs the binary directlyGo runtime + process management layer
Best forProduction serving, performance tuning, multi-GPU rigs, advanced usersQuick experimentation, simple single-GPU setups, beginners

Recommendation: Use Ollama when you want a quick "just works" setup to experiment with models. Switch to llama.cpp when you need maximum performance, full control over inference parameters, access to any GGUF model from Hugging Face, or are running multi-GPU hardware where every configuration knob matters.

In This Section

Start with the container and GPU setup, then layer in CUDA, build the binary, and connect Open WebUI. Router Mode builds on that foundation.

Comments

Sign in with GitHub to leave a comment or reaction.