Prometheus And Grafana Stack On Proxmox

This page owns the actual stack build.

If Monitoring And Alerts is the decision layer, this is the runbook: one monitoring guest, host-side exporters, Prometheus for collection, and Grafana for visibility.

Architecture Overview

Proxmox host (192.168.50.20)
  -> node_exporter :9100
  -> nvidia_gpu_exporter :9835
  -> smartmon textfile collector
 
CT 400 monitoring (192.168.50.80)
  -> Prometheus :9090
  -> Grafana :3000
  -> optional pve-exporter :9221

That is enough to cover host CPU and memory, GPU thermals and VRAM, disk health, and any additional targets you decide to scrape later.

Create The Monitoring LXC

On the Proxmox host:

pct create 400 local:vztmpl/debian-12-standard_12.12-1_amd64.tar.zst \
  --hostname monitoring \
  --cores 2 \
  --memory 2048 \
  --swap 512 \
  --rootfs local-zfs:30 \
  --net0 name=eth0,bridge=vmbr0,ip=192.168.50.80/24,gw=192.168.50.1 \
  --nameserver 192.168.50.10 \
  --searchdomain lan \
  --unprivileged 1 \
  --features nesting=1 \
  --onboot 1 \
  --start 0 \
  --password

Then:

pct start 400
pct enter 400
 
# Update packages
apt update && apt upgrade -y
 
# Install essentials
apt install -y curl wget gnupg2 apt-transport-https software-properties-common lsb-release ca-certificates

Install `node_exporter` On The Host

# Check latest version at https://github.com/prometheus/node_exporter/releases
NODE_EXPORTER_VERSION="1.10.2"
 
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
 
tar xzf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
cp node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
chmod +x /usr/local/bin/node_exporter

useradd --no-create-home --shell /usr/sbin/nologin node_exporter
mkdir -p /var/lib/node_exporter/textfile_collector
chown node_exporter:node_exporter /var/lib/node_exporter/textfile_collector

cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Prometheus Node Exporter
Documentation=https://github.com/prometheus/node_exporter
Wants=network-online.target
After=network-online.target
 
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.textfile.directory=/var/lib/node_exporter/textfile_collector \
  --collector.filesystem.mount-points-exclude="^/(sys|proc|dev|host|etc)($$|/)" \
  --collector.hwmon \
  --collector.cpu.info \
  --collector.meminfo \
  --collector.diskstats \
  --collector.netdev \
  --collector.thermal_zone \
  --collector.loadavg \
  --collector.pressure \
  --web.listen-address=:9100
 
Restart=always
RestartSec=5
 
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now node_exporter
 
# Test the metrics endpoint
curl -s http://localhost:9100/metrics | head -20

Install NVIDIA Metrics On The Host

# Check latest version at https://github.com/utkuozdemir/nvidia_gpu_exporter/releases
NVIDIA_EXPORTER_VERSION="1.4.1"
 
cd /tmp
wget https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v${NVIDIA_EXPORTER_VERSION}/nvidia_gpu_exporter_${NVIDIA_EXPORTER_VERSION}_linux_x86_64.tar.gz
 
tar xzf nvidia_gpu_exporter_${NVIDIA_EXPORTER_VERSION}_linux_x86_64.tar.gz
cp nvidia_gpu_exporter /usr/local/bin/
chmod +x /usr/local/bin/nvidia_gpu_exporter

cat > /etc/systemd/system/nvidia_gpu_exporter.service << 'EOF'
[Unit]
Description=NVIDIA GPU Exporter for Prometheus
Documentation=https://github.com/utkuozdemir/nvidia_gpu_exporter
Wants=network-online.target
After=network-online.target
 
[Service]
Type=simple
ExecStart=/usr/local/bin/nvidia_gpu_exporter --web.listen-address=:9835
Restart=always
RestartSec=5
 
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now nvidia_gpu_exporter
 
# Test - should show GPU metrics
curl -s http://localhost:9835/metrics | grep -E "nvidia_gpu_(temperature|fan|memory|power|utilization)" | head -30

Install SMART Monitoring On The Host

apt install -y smartmontools

cat > /usr/local/bin/smartmon.sh << 'EOF'
#!/bin/bash
# SMART monitoring script for Prometheus node_exporter textfile collector
# Supports both traditional SATA/SAS and NVMe drives with temperature extraction
 
set -u
 
get_disks() {
    lsblk -d -n -o NAME,TYPE | awk '$2 == "disk" {print "/dev/" $1}'
}
 
parse_nvme_smart() {
    local disk=$1
    local output
 
    output=$(smartctl -a "$disk" 2>&1)
 
    # Extract temperature
    temp=$(echo "$output" | grep "^Temperature:" | head -1 | awk '{print $(NF-1)}')
    if [[ -n "$temp" ]] && [[ "$temp" =~ ^[0-9]+$ ]]; then
        echo "smartmon_temperature_celsius_value{disk=\"$disk\",type=\"nvme\"} $temp"
    fi
 
    # Extract percentage used
    percent_used=$(echo "$output" | grep "Percentage Used:" | awk '{print $(NF-1)}' | sed 's/%//')
    if [[ -n "$percent_used" ]] && [[ "$percent_used" =~ ^[0-9]+$ ]]; then
        echo "smartmon_percentage_used_raw_value{disk=\"$disk\",type=\"nvme\"} $percent_used"
    fi
 
    # Extract power on hours
    power_on_hours=$(echo "$output" | grep "Power On Hours:" | awk '{print $(NF-1)}' | tr -d ',')
    if [[ -n "$power_on_hours" ]] && [[ "$power_on_hours" =~ ^[0-9]+$ ]]; then
        echo "smartmon_power_on_hours_raw_value{disk=\"$disk\",type=\"nvme\"} $power_on_hours"
    fi
}
 
parse_sata_smart() {
    local disk=$1
    local output
 
    output=$(smartctl -a "$disk" 2>&1)
 
    # Extract temperature from SATA attributes
    temp=$(echo "$output" | grep "194 Temperature_Celsius" | awk '{print $10}')
    if [[ -n "$temp" ]] && [[ "$temp" =~ ^[0-9]+$ ]]; then
        echo "smartmon_temperature_celsius_value{disk=\"$disk\",type=\"sata\"} $temp"
    fi
 
    # Extract reallocated sectors
    reallocated=$(echo "$output" | grep "5 Reallocated_Sector" | awk '{print $10}')
    if [[ -n "$reallocated" ]] && [[ "$reallocated" =~ ^[0-9]+$ ]]; then
        echo "smartmon_reallocated_sector_ct_raw_value{disk=\"$disk\",type=\"sata\"} $reallocated"
    fi
 
    # Extract power on hours
    power=$(echo "$output" | grep "9 Power_On_Hours" | awk '{print $10}')
    if [[ -n "$power" ]] && [[ "$power" =~ ^[0-9]+$ ]]; then
        echo "smartmon_power_on_hours_raw_value{disk=\"$disk\",type=\"sata\"} $power"
    fi
}
 
# Get smartctl version - properly handle multiline output
smartctl_version=$(smartctl -V 2>&1 | head -1 | awk '{for(i=1;i<=NF;i++) if($i ~ /^[0-9]+\.[0-9]+/) print $i}')
if [[ -z "$smartctl_version" ]]; then
    smartctl_version="unknown"
fi
 
# Header
echo "# HELP smartmon_smartctl_version SMART metric smartctl_version"
echo "# TYPE smartmon_smartctl_version gauge"
echo "smartmon_smartctl_version{version=\"$smartctl_version\"} 1"
 
echo ""
echo "# HELP smartmon_device_active SMART metric device_active"
echo "# TYPE smartmon_device_active gauge"
 
echo ""
echo "# HELP smartmon_device_smart_healthy SMART metric device_smart_healthy"
echo "# TYPE smartmon_device_smart_healthy gauge"
 
echo ""
echo "# HELP smartmon_temperature_celsius_value SMART metric temperature_celsius_value"
echo "# TYPE smartmon_temperature_celsius_value gauge"
 
echo ""
echo "# HELP smartmon_reallocated_sector_ct_raw_value SMART metric reallocated_sector_ct_raw_value"
echo "# TYPE smartmon_reallocated_sector_ct_raw_value gauge"
 
echo ""
echo "# HELP smartmon_power_on_hours_raw_value SMART metric power_on_hours_raw_value"
echo "# TYPE smartmon_power_on_hours_raw_value gauge"
 
echo ""
echo "# HELP smartmon_percentage_used_raw_value SMART metric percentage_used_raw_value"
echo "# TYPE smartmon_percentage_used_raw_value gauge"
 
echo ""
echo "# HELP smartmon_smartctl_run SMART metric smartctl_run"
echo "# TYPE smartmon_smartctl_run gauge"
 
smartctl_run_timestamp=$(date +%s)
 
# Process each disk
for disk in $(get_disks); do
    if smartctl -i "$disk" &>/dev/null; then
        echo "smartmon_device_active{disk=\"$disk\"} 1"
 
        disk_info=$(smartctl -i "$disk" 2>&1)
 
        if echo "$disk_info" | grep -q "NVMe"; then
            # NVMe drive
            if echo "$disk_info" | grep -q "PASSED"; then
                echo "smartmon_device_smart_healthy{disk=\"$disk\",type=\"nvme\"} 1"
            else
                echo "smartmon_device_smart_healthy{disk=\"$disk\",type=\"nvme\"} 0"
            fi
 
            # Parse NVMe SMART
            parse_nvme_smart "$disk"
        else
            # SATA/SAS drive
            if smartctl -H "$disk" 2>&1 | grep -q "PASSED"; then
                echo "smartmon_device_smart_healthy{disk=\"$disk\",type=\"sata\"} 1"
            else
                echo "smartmon_device_smart_healthy{disk=\"$disk\",type=\"sata\"} 0"
            fi
 
            # Parse SATA SMART
            parse_sata_smart "$disk"
        fi
 
        echo "smartmon_smartctl_run{disk=\"$disk\"} $smartctl_run_timestamp"
    fi
done
EOF
 
chmod +x /usr/local/bin/smartmon.sh
 
# Test it
/usr/local/bin/smartmon.sh | head -30

cat > /etc/cron.d/smartmon-prometheus << 'EOF'
# Collect SMART metrics for Prometheus node_exporter textfile collector
# Runs every 5 minutes - output is ~2 KB per disk, negligible I/O
# CRITICAL: PATH must include /usr/sbin where smartctl is located
# CRITICAL: chmod ensures node_exporter user can read the output file
PATH=/usr/sbin:/usr/bin:/bin
*/5 * * * * root /usr/local/bin/smartmon.sh > /var/lib/node_exporter/textfile_collector/smartmon.prom 2>/dev/null && chmod 644 /var/lib/node_exporter/textfile_collector/smartmon.prom
EOF

# Force an immediate run
/usr/local/bin/smartmon.sh > /var/lib/node_exporter/textfile_collector/smartmon.prom && chmod 644 /var/lib/node_exporter/textfile_collector/smartmon.prom
 
# Verify node_exporter exposes the metrics
curl -s http://localhost:9100/metrics | grep "smartmon_temperature_celsius_value"

Install Prometheus In CT 400

useradd --no-create-home --shell /usr/sbin/nologin prometheus

PROMETHEUS_VERSION="3.2.1"
 
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
 
tar xzf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
cd prometheus-${PROMETHEUS_VERSION}.linux-amd64
 
# Install binaries
cp prometheus promtool /usr/local/bin/
chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
 
# Create config and data directories
mkdir -p /etc/prometheus /var/lib/prometheus
chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

cat > /etc/prometheus/prometheus.yml << 'EOF'
# Prometheus Configuration for Proxmox Homelab Monitoring
# Docs: https://prometheus.io/docs/prometheus/latest/configuration/
 
global:
  scrape_interval: 30s          # How often to scrape targets (30s is efficient for homelab)
  evaluation_interval: 30s      # How often to evaluate alerting rules
  scrape_timeout: 10s           # Timeout per scrape
 
scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
        labels:
          instance: "monitoring"
 
  - job_name: "proxmox-host"
    static_configs:
      - targets: ["192.168.50.20:9100"]
        labels:
          instance: "pve-host"
          environment: "homelab"
 
  - job_name: "llama-cpp"
    scrape_interval: 30s
    static_configs:
      - targets: ["192.168.50.45:9414"]
        labels:
          instance: "llama-cpp"
          service: "llm"
EOF

cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/
Wants=network-online.target
After=network-online.target
 
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/ \
  --storage.tsdb.retention.time=15d \
  --storage.tsdb.retention.size=10GB \
  --web.listen-address=:9090 \
  --web.enable-lifecycle
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=5
 
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now prometheus
 
# Verify web UI is accessible
curl -s http://localhost:9090/-/healthy

Optional `pve-exporter`

This is useful when the Proxmox API path behaves cleanly in your environment. It is optional on purpose.

apt install -y python3-full python3-pip
python3 -m venv /opt/pve-exporter
/opt/pve-exporter/bin/pip install prometheus-pve-exporter

On the Proxmox host:

# Create a dedicated monitoring user with read-only access
pveum user add monitoring@pve --comment "Prometheus monitoring (read-only)"
pveum acl modify / --user monitoring@pve --role PVEAuditor
 
# Create an API token (no privilege separation for simplicity)
pveum user token add monitoring@pve prometheus --privsep 0

Inside CT 400:

cat > /etc/prometheus/pve-exporter.yml << 'EOF'
default:
  user: monitoring@pve
  token_name: prometheus
  token_value: "YOUR_TOKEN_VALUE_HERE"
  verify_ssl: false
EOF

cat > /etc/systemd/system/pve-exporter.service << 'EOF'
[Unit]
Description=Prometheus PVE Exporter
Documentation=https://github.com/prometheus-pve/prometheus-pve-exporter
Wants=network-online.target
After=network-online.target
 
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/pve-exporter/bin/pve_exporter \
  --config.file=/etc/prometheus/pve-exporter.yml \
  --web.listen-address=:9221
Restart=on-failure
RestartSec=5
 
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now pve-exporter
 
# Verify
curl -s "http://localhost:9221/pve?target=192.168.50.20&cluster=1&node=1" --max-time 15 | head -20

Install Grafana In CT 400

# Import Grafana GPG key
mkdir -p /etc/apt/keyrings
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor > /etc/apt/keyrings/grafana.gpg
 
# Add repository (OSS edition - free, open-source)
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" \
  > /etc/apt/sources.list.d/grafana.list

apt update
apt install -y grafana

cat >> /etc/grafana/grafana.ini << 'EOF'
 
[server]
http_addr = 0.0.0.0
http_port = 3000
domain = 192.168.50.80
 
[security]
admin_user = admin
admin_password = admin
 
[users]
allow_sign_up = false
 
[analytics]
reporting_enabled = false
check_for_updates = false
 
[log]
mode = console file
level = warn
EOF

systemctl daemon-reload
systemctl enable --now grafana-server

Then add Prometheus as the data source at http://localhost:9090.

Validate The Targets

curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -E '"job"|"health"'

All targets should eventually report health: "up".

Monitoring And Alerts - the overview page for this subsection.
Dashboards And Alerting On Proxmox - the next step once data is flowing.
Secure Service Exposure On Proxmox - if Grafana or Prometheus ever need remote access, decide that there instead of improvising it here.

Comments