The Replication Lag Problem
PostgreSQL replication lag is a silent killer. By the time your monitoring tool shows lag > 30 seconds, your business is already suffering. Traditional tools like pg_stat_replication only show current state - they can't tell you:
- Whether lag is trending up or down
- Which specific bottleneck (write, flush, or replay) is causing lag
- How long until replication completely fails
- If a standby will recover or needs intervention
What Other Tools Give You
pg_stat_replication (Native PostgreSQL)
SELECT client_addr, state, replay_lag
FROM pg_stat_replication;
This gives you a snapshot. You see lag is "00:00:15.234". Is that good? Bad? Getting worse? You don't know. You need to run this query repeatedly and track trends yourself.
pgWatch2 / Grafana Dashboards
These tools add visualization, but they're still reactive. They show you lag exceeded a threshold. They don't predict when it will happen. By the time the alert fires, your standby is already struggling.
PG Monitoring's Approach: Predictive Risk Scoring
PG Monitoring uses a multi-factor risk model that combines:
- Write lag - How far behind in receiving WAL
- Flush lag - Disk write delays on standby
- Replay lag - Actual application of changes
- Lag trend - Is lag accelerating?
- Slot bloat - Inactive slots retaining WAL
- Conflict counts - Queries being cancelled
Real-World Example
A financial services company had daily replication "hiccups" at 2 AM. Traditional monitoring showed "lag spikes" but couldn't explain why.
PG Monitoring identified:
- Write lag normal (network OK)
- Flush lag spiking (standby disk I/O saturated)
- Root cause: Nightly backup job on standby server
- Solution: Adjust backup window, add IOPS
Result: Replication lag went from 45-second spikes to consistent < 2 seconds.
Feature Comparison
| Feature | Native PG | pgWatch2 | PG Monitoring |
|---|---|---|---|
| Basic lag visibility | ✓ | ✓ | ✓ |
| Lag decomposition (write/flush/replay) | ✗ | ✗ | ✓ |
| Predictive risk score | ✗ | ✗ | ✓ |
| Slot bloat detection | Manual | Partial | ✓ |
| Conflict analysis | ✗ | ✗ | ✓ |
| Config drift detection | ✗ | ✗ | ✓ |
The Bottom Line
Traditional replication monitoring tells you "lag is 30 seconds." PG Monitoring tells you "replay lag is increasing because standby disk I/O is saturated due to backup job conflict. Fix backup window or add IOPS. Risk score: 78/100."
That's the difference between reactive and predictive monitoring.