How We Solved Replication Lag Detection: Beyond Basic Monitoring

The Replication Lag Problem

PostgreSQL replication lag is a silent killer. By the time your monitoring tool shows lag > 30 seconds, your business is already suffering. Traditional tools like pg_stat_replication only show current state - they can't tell you:

Whether lag is trending up or down
Which specific bottleneck (write, flush, or replay) is causing lag
How long until replication completely fails
If a standby will recover or needs intervention

What Other Tools Give You

pg_stat_replication (Native PostgreSQL)

SELECT client_addr, state, replay_lag 
FROM pg_stat_replication;

This gives you a snapshot. You see lag is "00:00:15.234". Is that good? Bad? Getting worse? You don't know. You need to run this query repeatedly and track trends yourself.

pgWatch2 / Grafana Dashboards

These tools add visualization, but they're still reactive. They show you lag exceeded a threshold. They don't predict when it will happen. By the time the alert fires, your standby is already struggling.

PG Monitoring's Approach: Predictive Risk Scoring

PG Monitoring uses a multi-factor risk model that combines:

Write lag - How far behind in receiving WAL
Flush lag - Disk write delays on standby
Replay lag - Actual application of changes
Lag trend - Is lag accelerating?
Slot bloat - Inactive slots retaining WAL
Conflict counts - Queries being cancelled

Real-World Example

A financial services company had daily replication "hiccups" at 2 AM. Traditional monitoring showed "lag spikes" but couldn't explain why.

PG Monitoring identified:

Write lag normal (network OK)
Flush lag spiking (standby disk I/O saturated)
Root cause: Nightly backup job on standby server
Solution: Adjust backup window, add IOPS

Result: Replication lag went from 45-second spikes to consistent < 2 seconds.

Feature Comparison

Feature	Native PG	pgWatch2	PG Monitoring
Basic lag visibility	✓	✓	✓
Lag decomposition (write/flush/replay)	✗	✗	✓
Predictive risk score	✗	✗	✓
Slot bloat detection	Manual	Partial	✓
Conflict analysis	✗	✗	✓
Config drift detection	✗	✗	✓

The Bottom Line

Traditional replication monitoring tells you "lag is 30 seconds." PG Monitoring tells you "replay lag is increasing because standby disk I/O is saturated due to backup job conflict. Fix backup window or add IOPS. Risk score: 78/100."

That's the difference between reactive and predictive monitoring.

How We Solved Replication Lag Detection: Beyond Basic Monitoring

The Replication Lag Problem

What Other Tools Give You

pg_stat_replication (Native PostgreSQL)

pgWatch2 / Grafana Dashboards

PG Monitoring's Approach: Predictive Risk Scoring

Real-World Example

Feature Comparison

The Bottom Line

Share this article

Related Articles

Streaming replication no PostgreSQL, passo a passo

Logical replication no PostgreSQL: seletiva, entre versões e sem downtime

Replication slots no PostgreSQL: a rede de segurança que pode afundar seu disco

Ready to experience better PostgreSQL monitoring?