Back to Infrastructure

Node Monitoring

Set up comprehensive monitoring for your blockchain infrastructure with Prometheus, Grafana, and alerting to ensure maximum uptime and performance.

The Monitoring Stack

Prometheus

Metrics Collection

Time-series database for collecting and storing metrics from your nodes

Grafana

Visualization

Dashboard platform for creating beautiful visualizations of your metrics

Alertmanager

Alerting

Handles alerts from Prometheus and routes them to various notification channels

Node Exporter

System Metrics

Collects hardware and OS metrics like CPU, memory, and disk usage

Key Metrics to Monitor

Node Health

Sync Status

Is the node fully synced?

Critical
Peer Count

Number of connected peers

Critical
Block Height

Current block vs network head

Critical
Chain Reorgs

Number of chain reorganizations

System Resources

CPU Usage

Processor utilization percentage

Critical
Memory Usage

RAM consumption

Critical
Disk I/O

Read/write operations per second

Critical
Disk Space

Available storage remaining

Critical

Network

Bandwidth In/Out

Network traffic volume

P2P Latency

Peer connection latency

RPC Response Time

API response latency

Critical
Connection Errors

Failed connection attempts

Critical

Quick Setup with Docker

docker-compose.monitoring.yml
version: "3.8"

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'

volumes:
  prometheus-data:
  grafana-data:

Prometheus Configuration

prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "alert_rules.yml"

scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter (system metrics)
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  # Geth metrics
  - job_name: 'geth'
    static_configs:
      - targets: ['geth:6060']
    metrics_path: /debug/metrics/prometheus

  # Lighthouse metrics
  - job_name: 'lighthouse'
    static_configs:
      - targets: ['lighthouse:5054']

Essential Alert Rules

Node Out of Sync

sync_status == false for 5m

Action: Check node logs, restart if needed

critical

Low Peer Count

peer_count < 10 for 10m

Action: Check network connectivity, firewall rules

warning

High CPU Usage

cpu_usage > 90% for 15m

Action: Investigate processes, consider scaling

warning

Disk Space Critical

disk_free < 50GB

Action: Prune data or expand storage immediately

critical

Memory Pressure

memory_usage > 85% for 10m

Action: Check for memory leaks, adjust limits

warning

Recommended Grafana Dashboards

Node Overview

High-level view of sync status, peer count, and block height across all nodes.

Dashboard ID: 13473

System Metrics

Detailed CPU, memory, disk, and network metrics from Node Exporter.

Dashboard ID: 1860

Geth Metrics

Geth-specific metrics including chain data, transaction pool, and RPC stats.

Dashboard ID: 13856

Monitoring Best Practices

Do

  • • Set up alerts for all critical metrics
  • • Use multiple notification channels (Slack, PagerDuty, email)
  • • Retain metrics for at least 30 days for trend analysis
  • • Create runbooks for each alert type
  • • Test alert routing regularly
  • • Monitor the monitoring system itself

Don't

  • • Create too many alerts (alert fatigue)
  • • Ignore warning-level alerts
  • • Skip documentation for dashboards
  • • Rely on a single notification channel
  • • Forget to monitor disk growth rate
  • • Set thresholds too tight (false positives)

Want Pre-Built Monitoring?

ChainLens provides built-in monitoring and alerting for all your blockchain infrastructure.