Home Lab Monitoring with Grafana and Prometheus
You can't fix what you can't see. Running a homelab without monitoring is like driving without a dashboard — everything seems fine until something breaks, and by then you've been running on fumes for a week.
Prometheus and Grafana are the standard monitoring stack for good reason. Prometheus scrapes metrics from your machines and services. Grafana turns those metrics into dashboards and alerts. Together, they give you visibility into your entire lab: CPU, RAM, disk, network, temperatures, container health, and anything else you care to track.
This guide walks you through setting up the full stack, getting metrics flowing from your machines, building useful dashboards, and configuring alerts so you know when something needs attention.
Architecture Overview
The monitoring stack has three components:
- Prometheus — The time-series database. It scrapes metrics from exporters at regular intervals and stores them.
- Node Exporter — Runs on every machine you want to monitor. Exposes hardware and OS metrics as an HTTP endpoint.
- Grafana — The visualization layer. Connects to Prometheus and lets you build dashboards, run queries, and set up alerts.
You can run all three on the same machine, or split them up. For a homelab, running Prometheus and Grafana on a single Docker host is fine. Node Exporter runs on every target machine.
Installing the Stack with Docker Compose
Create a directory for your monitoring stack:
mkdir -p /opt/monitoring
cd /opt/monitoring
Create docker-compose.yml:
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=90d'
- '--web.enable-lifecycle'
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
prometheus-data:
grafana-data:
Create prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets:
- '192.168.1.10:9100' # proxmox
- '192.168.1.50:9100' # nas
- '192.168.1.53:9100' # pihole
- '192.168.1.60:9100' # monitoring-pi
- '192.168.1.61:9100' # vpn-pi
Start the stack:
docker compose up -d
Grafana is now at http://your-server:3000 (login: admin / changeme). Prometheus is at http://your-server:9090.
Installing Node Exporter on Target Machines
Node Exporter needs to run on every machine you want to monitor. There are several ways to install it.
Direct Install (Recommended for Bare Metal and VMs)
# Download the latest release
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xzf node_exporter-1.8.2.linux-amd64.tar.gz
# Install the binary
sudo cp node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/
sudo chmod +x /usr/local/bin/node_exporter
# Create a system user
sudo useradd --no-create-home --shell /bin/false node_exporter
Create a systemd service at /etc/systemd/system/node_exporter.service:
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
User=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
Start it:
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
Verify it's working:
curl http://localhost:9100/metrics | head -20
Docker Install (For Docker Hosts)
If the target machine runs Docker:
docker run -d --restart=unless-stopped \
--name node-exporter \
--net=host \
--pid=host \
-v /:/host:ro,rslave \
prom/node-exporter:latest \
--path.rootfs=/host
The --net=host and --pid=host flags are needed for accurate metrics. The root filesystem mount lets the exporter see disk usage.
Connecting Grafana to Prometheus
- Open Grafana (
http://your-server:3000) - Go to Connections > Data Sources > Add data source
- Select Prometheus
- Set the URL to
http://prometheus:9090(if on the same Docker network) orhttp://your-server-ip:9090 - Click Save & Test
Building Dashboards
Import a Pre-Built Dashboard
Don't start from scratch. The community has excellent pre-built dashboards. Go to Dashboards > Import and enter a dashboard ID from grafana.com/grafana/dashboards:
- Node Exporter Full (ID: 1860) — The most comprehensive node exporter dashboard. CPU, RAM, disk, network, filesystem, and more. This is the one most people use.
- Node Exporter for Prometheus (ID: 11074) — A cleaner, more focused alternative.
Enter the ID, select your Prometheus data source, and click Import. You'll instantly have a full dashboard for all your monitored hosts.
Build a Custom Overview Dashboard
The imported dashboards show detailed per-host metrics. For a homelab overview, build a custom dashboard with panels showing:
CPU Usage Across All Hosts:
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory Usage Per Host:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Disk Usage Per Host:
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
Network Traffic:
rate(node_network_receive_bytes_total{device!="lo"}[5m]) * 8
System Uptime:
node_time_seconds - node_boot_time_seconds
For each panel, set meaningful thresholds — green below 70%, yellow at 70-85%, red above 85%. Use the Stat panel type for single values and Time series for historical data.
What to Monitor
Not all metrics are equally useful. Here's what actually matters in a homelab:
Critical (Set Alerts for These)
- Disk usage — The number one homelab killer. When a root partition fills up, things break in ugly ways. Alert at 85%.
- RAM usage — Especially on ZFS hosts where ARC cache makes free memory reporting confusing. Alert at 90%.
- Drive health — If you export SMART data (see below), alert on reallocated sectors or pending sectors.
- Service availability — Is Pi-hole responding? Is your NAS reachable?
Important (Check Weekly)
- CPU usage patterns — Sustained high CPU usually means a runaway process or misconfigured service.
- Network throughput — Unusual spikes can indicate a misconfigured backup, a download gone wrong, or worse.
- Temperatures — CPU and disk temperatures. Especially important in enclosed spaces or during summer.
- Swap usage — Any swap activity means you're running out of RAM.
Nice to Have
- System load averages — The 1/5/15 minute load gives you a feel for overall system health.
- Disk I/O — Useful for identifying bottlenecks, especially on shared NAS storage.
- Network errors — Non-zero error counts indicate bad cables or network interface issues.
Setting Up Alerts
Grafana can send alerts through email, Discord, Slack, Telegram, and many other channels. Go to Alerting > Contact Points to configure where alerts go.
Example: Disk Space Alert
- Go to Alerting > Alert Rules > New Alert Rule
- Query:
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
- Set condition: Is above 85
- Evaluation interval: every 5 minutes
- Choose your contact point
Example: Host Down Alert
up{job="node"} == 0
This fires when Prometheus can't reach a node exporter. Set the pending period to 2-3 minutes to avoid false alarms from brief network blips.
Discord Webhook (Popular for Homelabs)
- In Discord, go to your server's channel settings > Integrations > Webhooks
- Create a webhook and copy the URL
- In Grafana, create a contact point with type Discord, paste the webhook URL
- Test it
Now you'll get Discord notifications when your lab needs attention.
Additional Exporters
Node Exporter covers system metrics, but you can monitor much more:
cAdvisor (Container Metrics)
docker run -d --restart=unless-stopped \
--name cadvisor \
-p 8080:8080 \
-v /:/rootfs:ro \
-v /var/run:/var/run:ro \
-v /sys:/sys:ro \
-v /var/lib/docker/:/var/lib/docker:ro \
gcr.io/cadvisor/cadvisor:latest
Add to prometheus.yml:
- job_name: 'cadvisor'
static_configs:
- targets: ['192.168.1.10:8080']
SNMP Exporter (Network Gear)
Monitor your router, switches, and access points if they support SNMP:
- job_name: 'snmp'
static_configs:
- targets:
- 192.168.1.1 # router
metrics_path: /snmp
params:
module: [if_mib]
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: snmp-exporter:9116
Blackbox Exporter (Endpoint Monitoring)
Check if your services are actually responding:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://192.168.1.53/admin # Pi-hole
- http://192.168.1.80 # Nginx
- http://192.168.1.50 # NAS
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter:9115
Storage and Retention
Prometheus stores data on disk. For a homelab with 5-10 hosts scraped every 15 seconds, expect:
- ~2-3 GB per month of storage
- 200-500 MB RAM for Prometheus
Set retention with the --storage.tsdb.retention.time flag. 90 days is a good default for homelabs — enough to spot trends without eating too much disk. If you need longer retention, look into Thanos or VictoriaMetrics as long-term storage backends.
Tips
Start with imported dashboards: Don't spend hours building dashboards before you understand what metrics matter to you. Use the community dashboards for a week, then customize.
Scrape interval of 15s is fine: Some people set it to 5s or even 1s. For a homelab, 15 seconds is plenty. Lower intervals increase storage and CPU usage without adding much value.
Label your instances: In prometheus.yml, you can add labels to make dashboard filtering easier:
- targets: ['192.168.1.50:9100']
labels:
hostname: 'nas'
location: 'basement'
Use Grafana playlists for a status screen: If you have a spare monitor or tablet, set up a playlist that cycles through your dashboards. It's satisfying and genuinely useful.
Monitoring might seem like overkill for a homelab, but it's one of those things where once you have it, you can't imagine operating without it. The first time Grafana alerts you that a disk is filling up before it causes a problem, the setup time pays for itself.