Build And Serve

Build llama.cpp from source with CUDA support, download GGUF models from Hugging Face, and run llama-server as a systemd service.

Published May 28, 2025 · Updated June 18, 2025

Build And Serve

Build llama.cpp from Source

Building from source is the recommended approach — it produces binaries optimised for your specific GPU and system.

# Clone the repository
git clone https://github.com/ggml-org/llama.cpp
 
# Build from inside the repo
cd /root/llama.cpp
 
# Enable HTTPS downloads for -hf (requires OpenSSL dev files)
apt update
apt install -y libssl-dev
 
# Configure with CUDA and curl support (curl enables -hf direct downloads)
cmake -S . -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DLLAMA_OPENSSL=ON
 
# Build (use all available cores)
cmake --build build --config Release -j$(nproc)

Optional — target your specific GPU compute capability for faster compilation and slightly better performance:

# RTX 3090 = compute 8.6, RTX 4090 = 8.9, RTX 5060 Ti = 12.0
cmake -S . -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DCMAKE_CUDA_ARCHITECTURES="86"

Binaries Produced

BinaryLocationPurpose
llama-serverllama.cpp/build/bin/llama-serverOpenAI-compatible HTTP API server
llama-clillama.cpp/build/bin/llama-cliInteractive CLI chat / text completion
llama-benchllama.cpp/build/bin/llama-benchBenchmark inference speed (tokens/sec)
llama-gguf-splitllama.cpp/build/bin/llama-gguf-splitSplit/merge large GGUF files

Download Models from Hugging Face

Models must be in GGUF format. The best sources are:

Quantization Guide

QuantizationBitsQualityVRAM UsageWhen to Use
Q2_K2-bitLowSmallestVery limited VRAM, willing to sacrifice quality
Q3_K_M3-bitFairLowEntry-level GPUs (8 GB VRAM)
Q4_K_M4-bitGoodModerateBest balance — recommended starting point
Q5_K_M5-bitVery GoodHigher24 GB+ VRAM, quality matters
Q6_K6-bitExcellentHigh48 GB+ VRAM or multi-GPU
Q8_08-bitNear-FP16HighestEvaluation/research, maximum quality

Method 1: Direct Download via llama-server (Simplest)

./llama.cpp/build/bin/llama-server \
  -hf unsloth/Qwen3-8B-GGUF:Q4_K_M \
  --port 8012 \
  --host 0.0.0.0

Method 2: Manual Download via huggingface-cli

# Install Python and create an environment
apt install -y python3-full python3-pip pipenv
 
mkdir -p ~/hf/scripts && cd ~/hf/scripts
pipenv shell
 
# Install Hugging Face tools
pip install huggingface_hub hf_transfer

Create a download script:

cat > ~/hf/scripts/dl.py << 'DLEOF'
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"  # Set to "1" for faster downloads if your connection is stable
from huggingface_hub import snapshot_download
 
snapshot_download(
    repo_id="unsloth/Qwen3-8B-GGUF",
    local_dir="models/Qwen3-8B-GGUF",
    allow_patterns=["*Q4_K_M*"],
)
DLEOF
 
# Qwen3 32B
cat > ~/hf/scripts/dl_qwen3_32b.py << 'DLEOF'
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"
from huggingface_hub import snapshot_download
 
snapshot_download(
    repo_id="unsloth/Qwen3-32B-GGUF",
    local_dir="models/Qwen3-32B-GGUF",
    allow_patterns=["*Q4_K_M*"],
)
DLEOF
 
# DeepSeek R1 32B
cat > ~/hf/scripts/dl_deepseek_r1_32b.py << 'DLEOF'
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"
from huggingface_hub import snapshot_download
 
snapshot_download(
    repo_id="unsloth/DeepSeek-R1-32B-GGUF",
    local_dir="models/DeepSeek-R1-32B-GGUF",
    allow_patterns=["*Q4_K_M*"],
)
DLEOF
 
# DeepSeek R1 8B
cat > ~/hf/scripts/dl_deepseek_r1_qwen3_8b.py << 'DLEOF'
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"
from huggingface_hub import snapshot_download
 
snapshot_download(
    repo_id="unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF",
    local_dir="models/DeepSeek-R1-0528-Qwen3-8B-GGUF",
    allow_patterns=["*DeepSeek-R1-0528-Qwen3-8B-BF16.gguf*", "config.json"],
)
DLEOF

Best practice: Store all models in a dedicated directory:

mkdir -p /root/models

Run llama-server

Basic Launch

./llama.cpp/build/bin/llama-server \
  --model /root/models/Qwen3-8B-GGUF/Qwen3-8B-Q4_K_M.gguf \
  --port 8012 \
  --host 0.0.0.0 \
  --api-key your-secret-key \
  --n-gpu-layers 99 \
  --ctx-size 131072 \
  --threads 12

Verify the Server

# From inside the container or any machine on the network
curl http://localhost:8012/v1/models
 
# Test a completion
curl http://localhost:8012/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secret-key" \
  -d '{
    "model": "Qwen3-8B",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
cat > /etc/systemd/system/llama-server.service << 'EOF'
[Unit]
Description=llama.cpp Inference Server
After=network.target
 
[Service]
Type=simple
ExecStart=/root/llama.cpp/build/bin/llama-server \
  --model /root/hf/models/Qwen3-8B-GGUF/Qwen3-8B-Q4_K_M.gguf \
  --port 8012 \
  --host 0.0.0.0 \
  --api-key your-secret-key \
  --n-gpu-layers 99 \
  --ctx-size 131072 \
  --threads 12
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal
Environment="CUDA_VISIBLE_DEVICES=0,1"
 
[Install]
WantedBy=multi-user.target
EOF
 
systemctl daemon-reload
systemctl enable llama-server
systemctl start llama-server

Manage the service:

# Check status
systemctl status llama-server
 
# View logs
journalctl -u llama-server -f
 
# Restart after config changes
systemctl restart llama-server

Next

Continue to Open WebUI Integration to connect the server to Open WebUI and tune performance.

For a deeper breakdown of commonly tuned llama-server flags, continue to llama-server Parameters Explained.

Comments

Sign in with GitHub to leave a comment or reaction.