Update And Maintenance

Homelab failures often arrive during maintenance windows, not during the quiet weeks in between them.

That is why update work needs its own discipline.

This page is the operating rhythm behind a Proxmox host that runs real services: host first, containers second, applications after that, and GPU changes treated as their own category instead of casually folded into everything else.

If alerts are not set up yet, do Email Notifications first. If the host still needs initial bring-up work, go back to Homelab Installation.

Golden Rules

Always snapshot before updating.
Update the host first and containers second.
Dry-run before committing.
Do not update GPU components and the host kernel blindly in one motion.
Keep NVIDIA driver versions aligned between host and GPU containers.
Update during low-usage windows.
Verify each stage before moving to the next one.

Just as important is knowing what not to do.

Mistake	Why It Hurts
`apt upgrade` instead of `apt full-upgrade`	Proxmox expects the full dependency resolution path.
Updating containers before the host	Guests can end up assuming a newer base than the running host provides.
Skipping snapshots	Rollback turns into restore-or-rebuild work instead of a quick unwind.
Updating NVIDIA versions out of sync	GPU workloads fail in frustrating ways that look bigger than they are.
Running `apt autoremove` without reading it	It is occasionally too enthusiastic.

Recommended Update Order

1. Proxmox VE Host
2. Non-GPU Containers
3. GPU Containers
4. Application Services
5. NVIDIA Driver Changes (only when actually needed)

That order is conservative on purpose.

Major Upgrades Are Their Own Window

Routine updates belong here. Major upgrades deserve a dedicated runbook.

The PVE 9.2 upgrade is the example that made the boundary obvious: PBS backups came first, container package updates came second, the host moved to kernel 7.0 third, and the reboot exposed a nova_core boot hang that needed its own RCA.

Use PVE 9.2 Upgrade Runbook for that full sequence, then read Kernel 7.0 Boot Hang RCA for the side effect and fix.

Pre-Update Checklist

Run this on the host before you touch anything else:

# === PRE-UPDATE CHECKLIST ===
echo "=== 1. Current Proxmox Version ==="
pveversion -v
 
echo -e "\n=== 2. ZFS Pool Health ==="
zpool status -x
# Should say: "all pools are healthy"
 
echo -e "\n=== 3. Disk Space ==="
df -h / /var
zfs list -o name,used,avail rpool
 
echo -e "\n=== 4. Container Status ==="
pct list
 
echo -e "\n=== 5. GPU Status ==="
nvidia-smi --query-gpu=name,driver_version,temperature.gpu,utilization.gpu,memory.used,memory.total --format=csv,noheader
# Check: GPU utilization should be 0% (no active inference)
 
echo -e "\n=== 6. NVIDIA DKMS Status ==="
dkms status
 
echo -e "\n=== 7. Running Services ==="
systemctl is-active pveproxy pvedaemon nvidia-persistenced
 
echo -e "\n=== 8. Last Backup Date ==="
ls -lt /backup/dump/ 2>/dev/null | head -5 || echo "No backups found in /backup/dump/"
 
echo -e "\n=== PRE-UPDATE CHECK COMPLETE ==="

If ZFS is not healthy, disk space is tight, or the GPU is busy, stop there and fix the condition first.

Snapshot Before You Start

# Generate date stamp
DATE=$(date +%Y-%m-%d)
 
# Snapshot all containers
for CTID in 100 102 103 105 200 201; do
  echo "Snapshotting CT $CTID..."
  pct snapshot $CTID pre-update-$DATE --description "Pre-update snapshot $DATE"
done
 
echo "All snapshots created."
pct listsnapshot 100  # Verify one of them

For high-risk work such as kernel or NVIDIA changes, take an extra backup of the GPU containers too.

Step 1: Update The Proxmox Host

Dry-Run First

# Refresh package lists
apt update
 
# Preview what will be upgraded (DRY RUN — no changes made)
apt full-upgrade -s

Read that output like it matters, because it does.

Apply The Update

# Apply all updates
apt full-upgrade -y
 
# Clean up
apt autoremove --purge -y
apt autoclean

Reboot If The Kernel Changed

# Check if a reboot is required
[ -f /var/run/reboot-required ] && echo "REBOOT REQUIRED" || echo "No reboot needed"
 
# If reboot required:
reboot

Verify After Reboot

# Verify Proxmox is running
pveversion -v
systemctl status pveproxy pvedaemon
 
# Verify NVIDIA driver survived the kernel update
nvidia-smi
dkms status
# Expected: nvidia/<version>, <kernel-version>, x86_64: installed
 
# Verify all containers auto-started
pct list

If DKMS Did Not Rebuild

# Check DKMS status
dkms status
 
# Force rebuild for current kernel
NVIDIA_VERSION=$(dkms status | grep nvidia | head -1 | awk -F'[, ]' '{print $2}')
KERNEL_VERSION=$(uname -r)
dkms install nvidia/$NVIDIA_VERSION -k $KERNEL_VERSION
 
# Reload modules
modprobe nvidia
modprobe nvidia_uvm
 
# Verify
nvidia-smi

Step 2: Update LXC Container OS Packages

Non-GPU containers go first. GPU containers come later after the host has already proven it is healthy.

# Non-GPU containers first (low risk)
for CTID in 103 105 200 201; do
  echo "=========================================="
  echo "Updating CT $CTID ($(pct config $CTID | grep hostname | awk '{print $2}'))..."
  echo "=========================================="
  pct exec $CTID -- bash -c "apt update && apt full-upgrade -y && apt autoremove --purge -y && apt autoclean"
  echo "CT $CTID done."
  echo ""
done
 
# GPU containers last
for CTID in 100 102; do
  echo "=========================================="
  echo "Updating CT $CTID ($(pct config $CTID | grep hostname | awk '{print $2}'))..."
  echo "=========================================="
  pct exec $CTID -- bash -c "apt update && apt full-upgrade -y && apt autoremove --purge -y && apt autoclean"
  echo "CT $CTID done."
  echo ""
done
 
echo "All container OS packages updated."

Verify the result:

# Quick health check — all containers should be running
pct list
 
# Check GPU access in GPU containers
pct exec 100 -- nvidia-smi --query-gpu=driver_version --format=csv,noheader
pct exec 102 -- nvidia-smi --query-gpu=driver_version --format=csv,noheader
# Both should show the same driver version as the host

Step 3: Update Application Services

This is where the lab starts to feel cross-domain.

Pi-hole may belong to networking conceptually. GPU inference may belong to GPU & AI conceptually. But if the question today is "what do I update on this Proxmox host and in what order?" then they all belong here for the duration of the maintenance window.

Ollama (CT 100)

# Check current version
pct exec 100 -- ollama --version
 
# Update Ollama
pct exec 100 -- bash -c "curl -fsSL https://ollama.com/install.sh | sh"
 
# Restart the service
pct exec 100 -- systemctl restart ollama
 
# Verify
pct exec 100 -- ollama --version
pct exec 100 -- systemctl status ollama --no-pager
pct exec 100 -- nvidia-smi

pct exec 100 -- curl -s http://localhost:58008/api/tags | head -20
# Should list your downloaded models

llama.cpp (CT 102)

# Check current version (commit hash)
pct exec 102 -- bash -c "cd ~/llama.cpp && git log --oneline -1"
 
# Stop the service before rebuilding
pct exec 102 -- systemctl stop llama-server
 
# Pull latest changes and rebuild
pct exec 102 -- bash -c "cd ~/llama.cpp && git pull && cmake -S . -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON && cmake --build build --config Release -j\$(nproc)"
 
# Restart the service
pct exec 102 -- systemctl restart llama-server
 
# Verify
pct exec 102 -- systemctl status llama-server --no-pager
pct exec 102 -- bash -c "cd ~/llama.cpp && git log --oneline -1"

# Check the server is responsive
curl -s http://192.168.50.45:8012/health
# Should return: {"status":"ok"}

Open WebUI (CT 103)

Use the same uv tool install --force shape as the Community-Scripts installer. The shorter upgrade shortcut skips the install constraint used by the helper script, so keep it out of routine maintenance.

For backup, rollback, and database-migration notes, use Open WebUI Standalone Frontend On Proxmox.

# Check current version
pct exec 103 -- /root/.local/bin/uv tool list | grep open-webui
 
# Snapshot before updating
pct snapshot 103 pre-openwebui-update-$(date +%Y%m%d) --description "Before OpenWebUI update"
 
# Enter the container
pct enter 103
 
# Stop the service
systemctl stop open-webui
 
# Update Open WebUI using the same method as the installer
uv tool install --force --python 3.12 --constraint <(echo "numba>=0.60") open-webui[all]
 
# Restart the service
systemctl restart open-webui
 
# Verify running version and service status
journalctl -u open-webui -n 30 --no-pager | grep "v0\."
systemctl status open-webui --no-pager

# Check the web interface is responding
curl -s -o /dev/null -w "%{http_code}" http://192.168.50.30:42
# Should return: 200

Pi-hole (CT 105)

# Check current version
pct exec 105 -- pihole version
 
# Update Pi-hole (core + web + FTL)
pct exec 105 -- pihole -up
 
# Verify
pct exec 105 -- pihole version
pct exec 105 -- pihole status

# Test DNS resolution through Pi-hole
dig @192.168.50.11 google.com +short
# Should return IP addresses
 
# Check the admin interface
curl -s -o /dev/null -w "%{http_code}" http://192.168.50.11/admin/
# Should return: 200 or 301

After updating the secondary Pi-hole, resynchronise it:

pct exec 105 -- gravity-sync pull

Step 4: NVIDIA Driver Updates

Do this only when you actually need a new driver version.

Preparation

# Record current driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader
# e.g., 580.126.09
 
# Stop ALL GPU-dependent services
pct exec 100 -- systemctl stop ollama
pct exec 102 -- systemctl stop llama-server
 
# Verify GPU is idle
nvidia-smi
# Should show "No running processes found"

Update Host Driver

# Download new driver .run file (replace version as needed)
cd /root
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/NEW_VERSION/NVIDIA-Linux-x86_64-NEW_VERSION.run
chmod +x NVIDIA-Linux-x86_64-NEW_VERSION.run
 
# Install with DKMS support
./NVIDIA-Linux-x86_64-NEW_VERSION.run --dkms --no-opengl-files --no-questions
 
# Verify
nvidia-smi
dkms status

Update Container Drivers

# Copy the .run file to each GPU container
for CTID in 100 102; do
  pct push $CTID /root/NVIDIA-Linux-x86_64-NEW_VERSION.run /root/NVIDIA-Linux-x86_64-NEW_VERSION.run
  pct exec $CTID -- chmod +x /root/NVIDIA-Linux-x86_64-NEW_VERSION.run
  pct exec $CTID -- /root/NVIDIA-Linux-x86_64-NEW_VERSION.run --no-kernel-module --no-opengl-files --no-questions
done

Verify Version Match

# Compare versions
echo "=== Host ===" && nvidia-smi --query-gpu=driver_version --format=csv,noheader
echo "=== CT 100 (Ollama) ===" && pct exec 100 -- nvidia-smi --query-gpu=driver_version --format=csv,noheader
echo "=== CT 102 (llama-cpp) ===" && pct exec 102 -- nvidia-smi --query-gpu=driver_version --format=csv,noheader

Restart GPU Services

# Restart GPU services
pct exec 100 -- systemctl start ollama
pct exec 102 -- systemctl start llama-server
 
# Verify services and GPU access
pct exec 100 -- systemctl status ollama --no-pager
pct exec 102 -- systemctl status llama-server --no-pager
 
# Quick inference test
pct exec 100 -- ollama list
curl -s http://192.168.50.45:8012/health

Post-Update Verification

# === POST-UPDATE VERIFICATION ===
echo "=== 1. Proxmox VE Version ==="
pveversion -v | head -5
 
echo -e "\n=== 2. ZFS Health ==="
zpool status -x
 
echo -e "\n=== 3. Container Status ==="
pct list
 
echo -e "\n=== 4. GPU Status ==="
nvidia-smi --query-gpu=name,driver_version,temperature.gpu --format=csv,noheader
 
echo -e "\n=== 5. NVIDIA Driver Match ==="
echo "Host: $(nvidia-smi --query-gpu=driver_version --format=csv,noheader)"
echo "CT100: $(pct exec 100 -- nvidia-smi --query-gpu=driver_version --format=csv,noheader 2>/dev/null || echo 'N/A')"
echo "CT102: $(pct exec 102 -- nvidia-smi --query-gpu=driver_version --format=csv,noheader 2>/dev/null || echo 'N/A')"
 
echo -e "\n=== 6. Service Health ==="
echo "Ollama:     $(pct exec 100 -- systemctl is-active ollama 2>/dev/null || echo 'unknown')"
echo "llama-srv:  $(pct exec 102 -- systemctl is-active llama-server 2>/dev/null || echo 'unknown')"
echo "Open WebUI: $(pct exec 103 -- systemctl is-active open-webui 2>/dev/null || echo 'unknown')"
echo "Pi-hole:    $(pct exec 105 -- systemctl is-active pihole-FTL 2>/dev/null || echo 'unknown')"
echo "Nginx:      $(pct exec 200 -- systemctl is-active nginx 2>/dev/null || echo 'unknown')"
echo "Cloudflared:$(pct exec 201 -- systemctl is-active cloudflared 2>/dev/null || echo 'unknown')"
 
echo -e "\n=== 7. Endpoint Tests ==="
echo "Ollama API:    $(curl -s -o /dev/null -w '%{http_code}' http://192.168.50.40:58008/api/tags 2>/dev/null || echo 'FAIL')"
echo "llama.cpp:     $(curl -s -o /dev/null -w '%{http_code}' http://192.168.50.45:8012/health 2>/dev/null || echo 'FAIL')"
echo "Open WebUI:    $(curl -s -o /dev/null -w '%{http_code}' http://192.168.50.30:42 2>/dev/null || echo 'FAIL')"
echo "Pi-hole admin: $(curl -s -o /dev/null -w '%{http_code}' http://192.168.50.11/admin/ 2>/dev/null || echo 'FAIL')"
echo "Nginx HTTPS:   $(curl -sk -o /dev/null -w '%{http_code}' https://192.168.50.60 2>/dev/null || echo 'FAIL')"
 
echo -e "\n=== 8. DNS Resolution ==="
dig @192.168.50.11 google.com +short | head -1
 
echo -e "\n=== POST-UPDATE VERIFICATION COMPLETE ==="

Clean Up Snapshots Later

DATE=$(date +%Y-%m-%d)
for CTID in 100 102 103 105 200 201; do
  pct delsnapshot $CTID pre-update-$DATE 2>/dev/null && echo "Deleted snapshot for CT $CTID" || echo "No snapshot found for CT $CTID"
done

Keep those snapshots for at least a day or two if the services matter.

Common Failure Modes

Proxmox Wants To Remove `proxmox-ve`

Stop immediately and inspect the repositories.

# Verify repos — see PROXMOX-3-OPERATIONS.md for correct config
cat /etc/apt/sources.list.d/pve-no-subscription.list
cat /etc/apt/sources.list.d/debian.sources
 
# Should show "trixie" in both files
# Re-run: apt update && apt full-upgrade -s

GPU Not Available Inside Containers After Reboot

# Verify persistence daemon is running
systemctl status nvidia-persistenced
 
# If not running:
systemctl enable --now nvidia-persistenced
 
# Check device nodes exist
ls -la /dev/nvidia*
 
# Restart the container (sometimes device passthrough needs a fresh start)
pct reboot 100
pct reboot 102

Pi-hole Update Fails

# Check DNS resolution inside the container
pct exec 105 -- dig github.com +short
 
# If DNS fails, the container may be using itself as DNS (circular dependency)
pct exec 105 -- cat /etc/resolv.conf
# Temporarily set external DNS:
pct exec 105 -- bash -c "echo 'nameserver 1.1.1.1' > /etc/resolv.conf"
pct exec 105 -- pihole -up
# Restore Pi-hole DNS after update completes

Roll Back A Container Snapshot

# List available snapshots
pct listsnapshot <CTID>
 
# Roll back to pre-update snapshot
pct rollback <CTID> pre-update-2025-01-19
 
# Start the container
pct start <CTID>

Email Notifications — make maintenance visible instead of silent.
PVE 9.2 Upgrade Runbook — the major upgrade path that extends this regular maintenance rhythm.
Kernel 7.0 Boot Hang RCA — the post-upgrade side effect that required physical console recovery.
Proxmox Workloads — the service and platform guides that sit behind many of these update routines.
Container Network Throttling — a useful temporary control when a single LXC is pulling large artifacts during a maintenance window.

Comments