Replication

How We Solved Replication Lag Detection: Beyond Basic Monitoring

PG Monitoring Team April 12, 2026 8 min read

The Replication Lag Problem

PostgreSQL replication lag is a silent killer. By the time your monitoring tool shows lag > 30 seconds, your business is already suffering. Traditional tools like pg_stat_replication only show current state - they can't tell you:

  • Whether lag is trending up or down
  • Which specific bottleneck (write, flush, or replay) is causing lag
  • How long until replication completely fails
  • If a standby will recover or needs intervention

What Other Tools Give You

pg_stat_replication (Native PostgreSQL)

SELECT client_addr, state, replay_lag 
FROM pg_stat_replication;

This gives you a snapshot. You see lag is "00:00:15.234". Is that good? Bad? Getting worse? You don't know. You need to run this query repeatedly and track trends yourself.

pgWatch2 / Grafana Dashboards

These tools add visualization, but they're still reactive. They show you lag exceeded a threshold. They don't predict when it will happen. By the time the alert fires, your standby is already struggling.

PG Monitoring's Approach: Predictive Risk Scoring

PG Monitoring uses a multi-factor risk model that combines:

  • Write lag - How far behind in receiving WAL
  • Flush lag - Disk write delays on standby
  • Replay lag - Actual application of changes
  • Lag trend - Is lag accelerating?
  • Slot bloat - Inactive slots retaining WAL
  • Conflict counts - Queries being cancelled

Real-World Example

A financial services company had daily replication "hiccups" at 2 AM. Traditional monitoring showed "lag spikes" but couldn't explain why.

PG Monitoring identified:

  1. Write lag normal (network OK)
  2. Flush lag spiking (standby disk I/O saturated)
  3. Root cause: Nightly backup job on standby server
  4. Solution: Adjust backup window, add IOPS

Result: Replication lag went from 45-second spikes to consistent < 2 seconds.

Feature Comparison

FeatureNative PGpgWatch2PG Monitoring
Basic lag visibility
Lag decomposition (write/flush/replay)
Predictive risk score
Slot bloat detectionManualPartial
Conflict analysis
Config drift detection

The Bottom Line

Traditional replication monitoring tells you "lag is 30 seconds." PG Monitoring tells you "replay lag is increasing because standby disk I/O is saturated due to backup job conflict. Fix backup window or add IOPS. Risk score: 78/100."

That's the difference between reactive and predictive monitoring.

Related Articles

Ready to experience better PostgreSQL monitoring?

Join thousands of teams who switched from traditional tools to PG Monitoring's AI-powered platform.

Talk to us