Limited VRAM Playbook (Large Models)
Practical tactics to run very large models on old or low-VRAM GPUs: smarter CPU/GPU placement, mmap control, KV cache compression, and long-run stability.
Published May 10, 2026
Limited VRAM Playbook (Large Models)
Can you run a very large model on old hardware with only 6 GB VRAM and still keep it usable? Yes.
This page captures a practical workflow for making that happen on llama.cpp, including what helped, what did not, and why.
Test Rig (Worst-Case Baseline)
- GPU: GTX 1060 6 GB (PCIe Gen 3)
- CPU: i3-8100 (4 cores, no HT)
- RAM: 24 GB DDR4
The point is not that this is an ideal rig. The point is that this is a floor. Newer hardware should perform better with the same settings.
The Baseline That Feels Slow
A naive split uses only --n-gpu-layers and leaves the rest on CPU/RAM.
llama-server \
--model /root/models/<model>.gguf \
--n-gpu-layers 20This can work, but often stalls around low token throughput because too much per-token work crosses PCIe.
Five Practical Levers
1) Move MoE Experts to CPU Intentionally
For MoE models, --n-cpu-moe is the key lever. Instead of a generic layer split, pin expert blocks to CPU and keep non-expert compute hot on GPU.
--n-cpu-moe 41Then tune downward (for example 35 or 36) to move some experts back to GPU if VRAM allows.
2) Disable mmap for Predictable Inference
Use --no-mmap so model data is loaded into RAM upfront instead of demand-paged from disk mid-inference.
--no-mmapThis usually reduces stutter and improves consistency on long sessions.
3) Re-balance Experts vs Context
If VRAM headroom exists, reduce --n-cpu-moe (for example 41 -> 35) to keep more work on GPU.
Tradeoff: more VRAM used by model weights leaves less room for KV cache/context.
4) Compress KV Cache Aggressively (Carefully)
Context growth is linear in KV cache memory. Lower KV precision can massively increase usable context.
--cache-type-k q4_0 --cache-type-v q3_0Use this with context tuning:
--ctx-size 128000
# then test
--ctx-size 256000If OOM appears, move one step back in GPU allocation (for MoE, often --n-cpu-moe +1).
5) Lock RAM for Long-Run Stability
A setup that benchmarks well can still degrade after hours if the OS pages model data out.
Use all three layers together:
- container permission to lock memory
- Docker
IPC_LOCKcapability - llama.cpp
--mlock
One Docker Command Pattern
This is the combined pattern (adjust model path and values):
docker run --rm \
--gpus all \
--cap-add IPC_LOCK \
--ulimit memlock=-1:-1 \
-p 8012:8012 \
-v /root/models:/models \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /models/<model>.gguf \
--host 0.0.0.0 \
--port 8012 \
--n-gpu-layers 99 \
--n-cpu-moe 36 \
--no-mmap \
--cache-type-k q4_0 \
--cache-type-v q3_0 \
--ctx-size 256000 \
--mlockDense vs MoE (Important)
The MoE-specific trick is --n-cpu-moe.
For dense models, this flag is irrelevant because there are no expert blocks. Dense tuning on limited VRAM relies more on:
--n-gpu-layers--ctx-size--cache-type-kand--cache-type-v- quantization choice (for example Q4_K_M vs Q5_K_M)
- avoiding paging (
--no-mmap,--mlock)
What Did Not Help Here
Speculative decoding can underperform on some MoE + SSM architectures due to verification overhead and memory traffic patterns, even when draft acceptance looks decent.
Treat speculative decoding as model-architecture dependent, not a universal speedup.
Practical Outcome Target
The goal is not just "it runs". The goal is interactive speed with enough context to be useful in real workflows.
On low-end hardware, these levers can move a setup from barely usable to stable, conversational throughput.