Node Monitoring

Set up comprehensive monitoring for your Ethereum nodes using Prometheus, Grafana, and alerting to ensure 24/7 reliability.

Monitoring Stack

Prometheus

Time-series database that scrapes and stores metrics from your nodes.

prometheus.io

Grafana

Visualization platform for creating dashboards and exploring metrics.

grafana.com

Alertmanager

Handles alerts sent by Prometheus, routing them to Slack, PagerDuty, etc.

Documentation

Docker Compose Setup

docker-compose.monitoring.yml
version: "3.8"

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./rules:/etc/prometheus/rules:ro
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
      - GF_INSTALL_PLUGINS=grafana-clock-panel
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager-data:/alertmanager

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'

volumes:
  prometheus-data:
  grafana-data:
  alertmanager-data:

Prometheus Configuration

prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node exporter (system metrics)
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  # Geth execution client
  - job_name: 'geth'
    static_configs:
      - targets: ['geth:6060']
    metrics_path: /debug/metrics/prometheus

  # Prysm beacon node
  - job_name: 'prysm-beacon'
    static_configs:
      - targets: ['prysm:8080']

  # Prysm validator (if running)
  - job_name: 'prysm-validator'
    static_configs:
      - targets: ['prysm-validator:8081']

Enable Metrics on Nodes

Geth: Add --metrics --metrics.addr 0.0.0.0
Prysm: Metrics enabled by default on port 8080

Key Metrics to Monitor

Node Health

eth_syncing

Sync status - should be false when synced

Critical
eth_blockNumber

Current block number

Critical
eth_peerCount

Number of connected peers

Critical
chain_head_block

Latest block on the chain

Performance

process_cpu_seconds_total

CPU usage of the node process

process_resident_memory_bytes

Memory usage

Critical
p2p_peers

Connected P2P peers count

Critical
rpc_duration_seconds

RPC request latency

Consensus

beacon_head_slot

Current beacon chain slot

Critical
beacon_finalized_epoch

Last finalized epoch

Critical
validator_count

Total validators known

attestation_count

Attestations processed

Recommended Alert Rules

Node Out of Sync

critical

Node has been syncing for too long or fell behind

eth_syncing == true for > 10m

Low Peer Count

warning

Node has fewer peers than recommended

eth_peerCount < 10

High Memory Usage

warning

Node memory approaching system limits

process_resident_memory_bytes > 28GB

Block Production Stalled

critical

No new blocks received in 5 minutes

rate(eth_blockNumber[5m]) == 0

Beacon Chain Not Finalized

critical

Beacon chain has not finalized in 30+ minutes

time() - beacon_finalized_epoch * 384 > 1800
rules/node-alerts.yml
groups:
  - name: ethereum-node
    rules:
      - alert: NodeOutOfSync
        expr: eth_syncing == 1
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Ethereum node is syncing"
          description: "Node has been syncing for more than 10 minutes"

      - alert: LowPeerCount
        expr: eth_peer_count < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low peer count"
          description: "Node has {{ $value }} peers (< 10)"

      - alert: HighMemoryUsage
        expr: process_resident_memory_bytes > 28e9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Node memory usage is {{ $value | humanize }}"

      - alert: BeaconNotFinalized
        expr: time() - beacon_finalized_epoch * 384 > 1800
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Beacon chain not finalizing"
          description: "No finalization for 30+ minutes"

Alertmanager Configuration

alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical'

receivers:
  - name: 'default'
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_URL}'
        channel: '#alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: 'critical'
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_URL}'
        channel: '#alerts-critical'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_KEY}'