llama-server Parameters Explained

A practical reference for the most-used llama.cpp server parameters, grouped by networking, performance, memory, context, sampling, and operations.

Published May 10, 2026

llama-server Parameters Explained

This page is a practical map of the options that matter most in real deployments.

It focuses on the parameters you will tune repeatedly in homelab and production-like setups, not every single flag in --help.

Upstream reference used for this summary:

Mental Model: Tune In This Order

  1. Make it reachable and secure (--host, --port, --api-key)
  2. Make it fit memory (--n-gpu-layers, --ctx-size, cache types, --fit)
  3. Make it fast enough (--threads, --threads-batch, --flash-attn, batching)
  4. Make it stable for long runs (--mlock, --no-mmap, metrics, timeouts)
  5. Then tune output behavior (temperature and sampling)

1) Core Startup And Networking

ParameterWhat it doesTypical use
-m, --modelPath to GGUF model fileRequired in single-model mode
--hostBind address (default 127.0.0.1)Use 0.0.0.0 for LAN access
--portHTTP port (default 8080)Use a fixed internal port like 8012
--api-keyEnables API key authAlways enable when exposed beyond localhost
-a, --aliasFriendly model ID in API responsesClean model names in Open WebUI
--api-prefixPrefixes all routesUseful behind reverse proxies
-to, --timeoutRead/write timeout (seconds)Raise for slow clients or long generations

Minimal single-model command

/root/llama.cpp/build/bin/llama-server \
  --model /root/models/Qwen3-8B-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8012 \
  --api-key your-secret-key

2) Memory Fit And Device Placement

These flags decide whether the server starts reliably and whether context fits.

ParameterWhat it doesRule of thumb
-ngl, --n-gpu-layersNumber of layers in VRAMIncrease until VRAM is near full but stable
-c, --ctx-sizeContext window sizeStart modest, then scale after stability
-ctk, --cache-type-kKV cache dtype for keysLower precision for larger context
-ctv, --cache-type-vKV cache dtype for valuesPair with cache-type-k
-fit, --fitAuto-adjust unset args to fit memoryKeep enabled unless hand-tuning everything
-fitt, --fit-targetVRAM headroom target per GPUKeep safety margin to avoid OOM edges
-dev, --deviceWhich devices to offload toExplicit multi-GPU placement
-sm, --split-modeMulti-GPU split behaviorDefault layer is usually easiest
-ts, --tensor-splitPer-GPU offload ratiosUse when GPUs are mismatched
-mg, --main-gpuMain GPU indexKeep deterministic assignment

MoE-specific placement

ParameterWhat it doesWhen to use
-cmoe, --cpu-moeKeep all MoE weights on CPUVery low VRAM environments
-ncmoe, --n-cpu-moeKeep first N MoE layers on CPUFine-grained VRAM balancing

3) Throughput And Concurrency

ParameterWhat it doesPractical guidance
-t, --threadsCPU threads for token generationStart near physical core count
-tb, --threads-batchThreads for prompt/batch workOften >= --threads helps prefills
-b, --batch-sizeLogical max batch sizeIncrease carefully; watch latency
-ub, --ubatch-sizePhysical micro-batch sizeTune with VRAM and backend behavior
-fa, --flash-attnFlash attention on/off/autoKeep auto or on when supported
-np, --parallelNumber of server slotsRaise for multi-user concurrency
-cb, --cont-batchingContinuous/dynamic batchingKeep enabled for shared endpoints
--threads-httpHTTP worker threadsRaise only if HTTP side is bottleneck

Stability note for concurrency

Higher --parallel improves throughput under load but can increase memory pressure and tail latency. Increase gradually and monitor /metrics.

4) Long-Run Reliability Flags

ParameterWhat it doesWhy it matters
--mlockLock model in RAMPrevents paging-related jitter
--no-mmapDisable memory-mapped model fileSlower startup, often fewer runtime stalls
--sleep-idle-secondsUnload model when idleSaves RAM/VRAM in bursty usage
--warmupEmpty warmup run at startupReduces first-request surprise latency
--metricsEnable Prometheus endpointRequired for serious tuning/ops
--slotsSlot monitoring endpointUseful for queue and slot visibility
--cache-promptPrefix reuse in KV cacheBig win for repeated prompt prefixes
--cache-reuseReuse chunks via KV shiftingHelps with long, similar prompts

Operational baseline

--mlock --no-mmap --metrics --cache-prompt

Then add --sleep-idle-seconds only if cold-start reload time is acceptable.

5) Output Behavior (Sampling)

These change response style and consistency, not hardware utilization.

ParameterWhat it doesTypical starting point
--tempRandomness0.6 to 0.8
--top-pNucleus sampling0.9 to 0.95
--top-kCandidate cap20 to 50
--min-pRelative probability floor0.02 to 0.08
--repeat-penaltyPenalize repetitions1.05 to 1.15
--repeat-last-nTokens checked for repetition64 is common
--seedDeterminism controlSet fixed value for reproducible tests
--mirostat + relatedAlternative adaptive samplerUse only when deliberately testing it

6) Chat Template, Tool Use, And API Compatibility

ParameterWhat it doesUse case
--jinjaEnables Jinja chat template engineRecommended for modern chat models
--chat-templateOverride templateFixes mismatched defaults
--chat-template-fileCustom template fileAdvanced custom prompting flows
--reasoning-formatParses reasoning output formatFor models with thought tags
--embeddingEmbedding-focused server modeDedicated embedding models
--rerankEnables reranking endpointRetrieval pipelines

7) Router Mode Parameters (Multi-Model)

ParameterWhat it doesTypical use
--models-dirDiscover models from directoryEasiest multi-model bootstrap
--models-presetINI preset file per modelProduction-style per-model tuning
--models-maxMax loaded models at onceControls memory pressure
--models-autoloadAuto-load on first requestKeep enabled for convenience
--no-models-autoloadManual load via APITighter operational control

If you run one endpoint for many models, these matter more than single-model --model flags.

8) Three Ready-Made Profiles

A. Single-user quality profile

--threads 8 --threads-batch 8 --ctx-size 32768 --temp 0.7 --top-p 0.95

B. Shared homelab endpoint profile

--parallel 2 --cont-batching --cache-prompt --metrics --timeout 600

C. Constrained VRAM profile

--n-gpu-layers 35 --ctx-size 131072 --cache-type-k q4_0 --cache-type-v q4_0 --mlock --no-mmap

9) How To Tune Safely (Quick Loop)

  1. Fix one benchmark prompt and one chat prompt.
  2. Record baseline from /v1/chat/completions timings and /metrics.
  3. Change one flag family at a time: memory, then throughput, then sampling.
  4. Keep notes on throughput, first-token latency, and OOM/reload behavior.
  5. Keep the best-known stable profile in your systemd service.

Comments

Sign in with GitHub to leave a comment or reaction.