Prometheus And Grafana Stack On Proxmox
Build a dedicated monitoring LXC, install host exporters, configure Prometheus scrape targets, and bring Grafana online for a Proxmox homelab.
Published January 28, 2025
Prometheus And Grafana Stack On Proxmox
This page owns the actual stack build.
If Monitoring And Alerts is the decision layer, this is the runbook: one monitoring guest, host-side exporters, Prometheus for collection, and Grafana for visibility.
Architecture Overview
Proxmox host (192.168.50.20)
-> node_exporter :9100
-> nvidia_gpu_exporter :9835
-> smartmon textfile collector
CT 400 monitoring (192.168.50.80)
-> Prometheus :9090
-> Grafana :3000
-> optional pve-exporter :9221That is enough to cover host CPU and memory, GPU thermals and VRAM, disk health, and any additional targets you decide to scrape later.
Create The Monitoring LXC
On the Proxmox host:
pct create 400 local:vztmpl/debian-12-standard_12.12-1_amd64.tar.zst \
--hostname monitoring \
--cores 2 \
--memory 2048 \
--swap 512 \
--rootfs local-zfs:30 \
--net0 name=eth0,bridge=vmbr0,ip=192.168.50.80/24,gw=192.168.50.1 \
--nameserver 192.168.50.10 \
--searchdomain lan \
--unprivileged 1 \
--features nesting=1 \
--onboot 1 \
--start 0 \
--passwordThen:
pct start 400
pct enter 400
# Update packages
apt update && apt upgrade -y
# Install essentials
apt install -y curl wget gnupg2 apt-transport-https software-properties-common lsb-release ca-certificatesInstall node_exporter On The Host
# Check latest version at https://github.com/prometheus/node_exporter/releases
NODE_EXPORTER_VERSION="1.10.2"
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
tar xzf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
cp node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
chmod +x /usr/local/bin/node_exporteruseradd --no-create-home --shell /usr/sbin/nologin node_exporter
mkdir -p /var/lib/node_exporter/textfile_collector
chown node_exporter:node_exporter /var/lib/node_exporter/textfile_collectorcat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Prometheus Node Exporter
Documentation=https://github.com/prometheus/node_exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.textfile.directory=/var/lib/node_exporter/textfile_collector \
--collector.filesystem.mount-points-exclude="^/(sys|proc|dev|host|etc)($$|/)" \
--collector.hwmon \
--collector.cpu.info \
--collector.meminfo \
--collector.diskstats \
--collector.netdev \
--collector.thermal_zone \
--collector.loadavg \
--collector.pressure \
--web.listen-address=:9100
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOFsystemctl daemon-reload
systemctl enable --now node_exporter
# Test the metrics endpoint
curl -s http://localhost:9100/metrics | head -20Install NVIDIA Metrics On The Host
# Check latest version at https://github.com/utkuozdemir/nvidia_gpu_exporter/releases
NVIDIA_EXPORTER_VERSION="1.4.1"
cd /tmp
wget https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v${NVIDIA_EXPORTER_VERSION}/nvidia_gpu_exporter_${NVIDIA_EXPORTER_VERSION}_linux_x86_64.tar.gz
tar xzf nvidia_gpu_exporter_${NVIDIA_EXPORTER_VERSION}_linux_x86_64.tar.gz
cp nvidia_gpu_exporter /usr/local/bin/
chmod +x /usr/local/bin/nvidia_gpu_exportercat > /etc/systemd/system/nvidia_gpu_exporter.service << 'EOF'
[Unit]
Description=NVIDIA GPU Exporter for Prometheus
Documentation=https://github.com/utkuozdemir/nvidia_gpu_exporter
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/nvidia_gpu_exporter --web.listen-address=:9835
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOFsystemctl daemon-reload
systemctl enable --now nvidia_gpu_exporter
# Test - should show GPU metrics
curl -s http://localhost:9835/metrics | grep -E "nvidia_gpu_(temperature|fan|memory|power|utilization)" | head -30Install SMART Monitoring On The Host
apt install -y smartmontoolscat > /usr/local/bin/smartmon.sh << 'EOF'
#!/bin/bash
# SMART monitoring script for Prometheus node_exporter textfile collector
# Supports both traditional SATA/SAS and NVMe drives with temperature extraction
set -u
get_disks() {
lsblk -d -n -o NAME,TYPE | awk '$2 == "disk" {print "/dev/" $1}'
}
parse_nvme_smart() {
local disk=$1
local output
output=$(smartctl -a "$disk" 2>&1)
# Extract temperature
temp=$(echo "$output" | grep "^Temperature:" | head -1 | awk '{print $(NF-1)}')
if [[ -n "$temp" ]] && [[ "$temp" =~ ^[0-9]+$ ]]; then
echo "smartmon_temperature_celsius_value{disk=\"$disk\",type=\"nvme\"} $temp"
fi
# Extract percentage used
percent_used=$(echo "$output" | grep "Percentage Used:" | awk '{print $(NF-1)}' | sed 's/%//')
if [[ -n "$percent_used" ]] && [[ "$percent_used" =~ ^[0-9]+$ ]]; then
echo "smartmon_percentage_used_raw_value{disk=\"$disk\",type=\"nvme\"} $percent_used"
fi
# Extract power on hours
power_on_hours=$(echo "$output" | grep "Power On Hours:" | awk '{print $(NF-1)}' | tr -d ',')
if [[ -n "$power_on_hours" ]] && [[ "$power_on_hours" =~ ^[0-9]+$ ]]; then
echo "smartmon_power_on_hours_raw_value{disk=\"$disk\",type=\"nvme\"} $power_on_hours"
fi
}
parse_sata_smart() {
local disk=$1
local output
output=$(smartctl -a "$disk" 2>&1)
# Extract temperature from SATA attributes
temp=$(echo "$output" | grep "194 Temperature_Celsius" | awk '{print $10}')
if [[ -n "$temp" ]] && [[ "$temp" =~ ^[0-9]+$ ]]; then
echo "smartmon_temperature_celsius_value{disk=\"$disk\",type=\"sata\"} $temp"
fi
# Extract reallocated sectors
reallocated=$(echo "$output" | grep "5 Reallocated_Sector" | awk '{print $10}')
if [[ -n "$reallocated" ]] && [[ "$reallocated" =~ ^[0-9]+$ ]]; then
echo "smartmon_reallocated_sector_ct_raw_value{disk=\"$disk\",type=\"sata\"} $reallocated"
fi
# Extract power on hours
power=$(echo "$output" | grep "9 Power_On_Hours" | awk '{print $10}')
if [[ -n "$power" ]] && [[ "$power" =~ ^[0-9]+$ ]]; then
echo "smartmon_power_on_hours_raw_value{disk=\"$disk\",type=\"sata\"} $power"
fi
}
# Get smartctl version - properly handle multiline output
smartctl_version=$(smartctl -V 2>&1 | head -1 | awk '{for(i=1;i<=NF;i++) if($i ~ /^[0-9]+\.[0-9]+/) print $i}')
if [[ -z "$smartctl_version" ]]; then
smartctl_version="unknown"
fi
# Header
echo "# HELP smartmon_smartctl_version SMART metric smartctl_version"
echo "# TYPE smartmon_smartctl_version gauge"
echo "smartmon_smartctl_version{version=\"$smartctl_version\"} 1"
echo ""
echo "# HELP smartmon_device_active SMART metric device_active"
echo "# TYPE smartmon_device_active gauge"
echo ""
echo "# HELP smartmon_device_smart_healthy SMART metric device_smart_healthy"
echo "# TYPE smartmon_device_smart_healthy gauge"
echo ""
echo "# HELP smartmon_temperature_celsius_value SMART metric temperature_celsius_value"
echo "# TYPE smartmon_temperature_celsius_value gauge"
echo ""
echo "# HELP smartmon_reallocated_sector_ct_raw_value SMART metric reallocated_sector_ct_raw_value"
echo "# TYPE smartmon_reallocated_sector_ct_raw_value gauge"
echo ""
echo "# HELP smartmon_power_on_hours_raw_value SMART metric power_on_hours_raw_value"
echo "# TYPE smartmon_power_on_hours_raw_value gauge"
echo ""
echo "# HELP smartmon_percentage_used_raw_value SMART metric percentage_used_raw_value"
echo "# TYPE smartmon_percentage_used_raw_value gauge"
echo ""
echo "# HELP smartmon_smartctl_run SMART metric smartctl_run"
echo "# TYPE smartmon_smartctl_run gauge"
smartctl_run_timestamp=$(date +%s)
# Process each disk
for disk in $(get_disks); do
if smartctl -i "$disk" &>/dev/null; then
echo "smartmon_device_active{disk=\"$disk\"} 1"
disk_info=$(smartctl -i "$disk" 2>&1)
if echo "$disk_info" | grep -q "NVMe"; then
# NVMe drive
if echo "$disk_info" | grep -q "PASSED"; then
echo "smartmon_device_smart_healthy{disk=\"$disk\",type=\"nvme\"} 1"
else
echo "smartmon_device_smart_healthy{disk=\"$disk\",type=\"nvme\"} 0"
fi
# Parse NVMe SMART
parse_nvme_smart "$disk"
else
# SATA/SAS drive
if smartctl -H "$disk" 2>&1 | grep -q "PASSED"; then
echo "smartmon_device_smart_healthy{disk=\"$disk\",type=\"sata\"} 1"
else
echo "smartmon_device_smart_healthy{disk=\"$disk\",type=\"sata\"} 0"
fi
# Parse SATA SMART
parse_sata_smart "$disk"
fi
echo "smartmon_smartctl_run{disk=\"$disk\"} $smartctl_run_timestamp"
fi
done
EOF
chmod +x /usr/local/bin/smartmon.sh
# Test it
/usr/local/bin/smartmon.sh | head -30cat > /etc/cron.d/smartmon-prometheus << 'EOF'
# Collect SMART metrics for Prometheus node_exporter textfile collector
# Runs every 5 minutes - output is ~2 KB per disk, negligible I/O
# CRITICAL: PATH must include /usr/sbin where smartctl is located
# CRITICAL: chmod ensures node_exporter user can read the output file
PATH=/usr/sbin:/usr/bin:/bin
*/5 * * * * root /usr/local/bin/smartmon.sh > /var/lib/node_exporter/textfile_collector/smartmon.prom 2>/dev/null && chmod 644 /var/lib/node_exporter/textfile_collector/smartmon.prom
EOF# Force an immediate run
/usr/local/bin/smartmon.sh > /var/lib/node_exporter/textfile_collector/smartmon.prom && chmod 644 /var/lib/node_exporter/textfile_collector/smartmon.prom
# Verify node_exporter exposes the metrics
curl -s http://localhost:9100/metrics | grep "smartmon_temperature_celsius_value"Install Prometheus In CT 400
useradd --no-create-home --shell /usr/sbin/nologin prometheusPROMETHEUS_VERSION="3.2.1"
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
tar xzf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
cd prometheus-${PROMETHEUS_VERSION}.linux-amd64
# Install binaries
cp prometheus promtool /usr/local/bin/
chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
# Create config and data directories
mkdir -p /etc/prometheus /var/lib/prometheus
chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheuscat > /etc/prometheus/prometheus.yml << 'EOF'
# Prometheus Configuration for Proxmox Homelab Monitoring
# Docs: https://prometheus.io/docs/prometheus/latest/configuration/
global:
scrape_interval: 30s # How often to scrape targets (30s is efficient for homelab)
evaluation_interval: 30s # How often to evaluate alerting rules
scrape_timeout: 10s # Timeout per scrape
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
labels:
instance: "monitoring"
- job_name: "proxmox-host"
static_configs:
- targets: ["192.168.50.20:9100"]
labels:
instance: "pve-host"
environment: "homelab"
- job_name: "llama-cpp"
scrape_interval: 30s
static_configs:
- targets: ["192.168.50.45:9414"]
labels:
instance: "llama-cpp"
service: "llm"
EOFcat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--storage.tsdb.retention.time=15d \
--storage.tsdb.retention.size=10GB \
--web.listen-address=:9090 \
--web.enable-lifecycle
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOFsystemctl daemon-reload
systemctl enable --now prometheus
# Verify web UI is accessible
curl -s http://localhost:9090/-/healthyOptional pve-exporter
This is useful when the Proxmox API path behaves cleanly in your environment. It is optional on purpose.
apt install -y python3-full python3-pip
python3 -m venv /opt/pve-exporter
/opt/pve-exporter/bin/pip install prometheus-pve-exporterOn the Proxmox host:
# Create a dedicated monitoring user with read-only access
pveum user add monitoring@pve --comment "Prometheus monitoring (read-only)"
pveum acl modify / --user monitoring@pve --role PVEAuditor
# Create an API token (no privilege separation for simplicity)
pveum user token add monitoring@pve prometheus --privsep 0Inside CT 400:
cat > /etc/prometheus/pve-exporter.yml << 'EOF'
default:
user: monitoring@pve
token_name: prometheus
token_value: "YOUR_TOKEN_VALUE_HERE"
verify_ssl: false
EOFcat > /etc/systemd/system/pve-exporter.service << 'EOF'
[Unit]
Description=Prometheus PVE Exporter
Documentation=https://github.com/prometheus-pve/prometheus-pve-exporter
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/pve-exporter/bin/pve_exporter \
--config.file=/etc/prometheus/pve-exporter.yml \
--web.listen-address=:9221
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOFsystemctl daemon-reload
systemctl enable --now pve-exporter
# Verify
curl -s "http://localhost:9221/pve?target=192.168.50.20&cluster=1&node=1" --max-time 15 | head -20Install Grafana In CT 400
# Import Grafana GPG key
mkdir -p /etc/apt/keyrings
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor > /etc/apt/keyrings/grafana.gpg
# Add repository (OSS edition - free, open-source)
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" \
> /etc/apt/sources.list.d/grafana.listapt update
apt install -y grafanacat >> /etc/grafana/grafana.ini << 'EOF'
[server]
http_addr = 0.0.0.0
http_port = 3000
domain = 192.168.50.80
[security]
admin_user = admin
admin_password = admin
[users]
allow_sign_up = false
[analytics]
reporting_enabled = false
check_for_updates = false
[log]
mode = console file
level = warn
EOFsystemctl daemon-reload
systemctl enable --now grafana-serverThen add Prometheus as the data source at http://localhost:9090.
Validate The Targets
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -E '"job"|"health"'All targets should eventually report health: "up".
Related Topics
- Monitoring And Alerts - the overview page for this subsection.
- Dashboards And Alerting On Proxmox - the next step once data is flowing.
- Secure Service Exposure On Proxmox - if Grafana or Prometheus ever need remote access, decide that there instead of improvising it here.