The Problem with Threshold-Based Alerting
Traditional monitoring uses static thresholds:
- "Alert if CPU > 80%"
- "Alert if connections > 90%"
- "Alert if replication lag > 30s"
These alerts have two fatal flaws:
- False positives: 80% CPU during nightly batch job is normal
- False negatives: 40% CPU when normal is 15% is a problem, but doesn't alert
How Other Tools Handle Anomaly Detection
Datadog / New Relic APM
Add PostgreSQL plugins with "baseline" detection:
- Learns patterns over days/weeks
- Alerts on deviations from baseline
- Requires external SaaS (data leaves premises)
- Generic - not PostgreSQL-specific
Cost: $70-150/server/month
TimescaleDB / Prometheus + Grafana
Build your own with recording rules:
- Stores time-series data
- Supports complex queries for anomaly detection
- Requires expertise to build ML models
- No built-in PostgreSQL understanding
Complexity: High. Most teams never implement effective ML alerting.
pgDash
Offers "intelligent alerting" but:
- Simple statistical methods only
- No cross-metric correlation
- Reactive, not predictive
PG Monitoring's ML-Powered Anomaly Detection
1. Multi-Dimensional Pattern Learning
Not just "is CPU high?" but "is this combination of metrics abnormal?"
Learned pattern for Tuesday 14:00:
- CPU: 45% ± 12%
- Active connections: 67 ± 8
- Cache hit ratio: 96% ± 2%
- Transaction rate: 1,200/sec ± 200
- Replication lag: 0.1s ± 0.05s
Current reading:
- CPU: 48% (normal)
- Connections: 69 (normal)
- Cache hit: 94% (normal)
- Replication lag: 2.3s (ABNORMAL - 46σ from mean)
Alert: Replication lag anomaly detected
Root cause analysis: Standby disk I/O saturation
2. Leading Indicator Detection
Find metrics that predict problems before they happen:
Historical analysis shows:
- Checkpoint write time increase → 15min later: query slowdown
- Dead tuple ratio growth → 45min later: autovacuum storm
- Connection wait queue growth → 5min later: connection errors
Predictive alert triggered:
"Checkpoint write time up 300% - expect query performance
degradation in ~15 minutes. Investigate disk I/O now."
3. Seasonal Pattern Recognition
Weekday 9 AM spike is normal. Sunday 3 AM spike is not.
Learned daily patterns:
- 09:00: Login spike (80% above baseline) - EXPECTED
- 14:00: Analytics batch job - EXPECTED
- 03:00 Sunday: Connection spike - UNEXPECTED
Anomaly score: 94/100
Investigation: Unauthorized API access detected
Case Study: Predicting the Unpredictable
A retail customer's database would mysteriously slow down every few weeks.
Traditional monitoring: Showed high CPU and connection errors when already slow.
PG Monitoring anomaly detection found:
Pattern detected 3 weeks before each incident:
- Dead tuple ratio grew 5% daily for 7 days
- Autovacuum couldn't keep up
- Table bloat crossed 40% threshold
- Query plans changed from index to seq scan
- Performance collapsed
Root cause: Long-running transaction preventing vacuum
(predictable, but threshold alerts never caught the pattern)
Fix: Added monitoring for long-running transactions. Now auto-terminates transactions > 2 hours.
Feature Comparison
| Feature | Static Thresholds | Datadog | TimescaleDB DIY | PG Monitoring |
|---|---|---|---|---|
| Baseline learning | ✗ | ✓ | Build yourself | ✓ |
| Multi-metric correlation | ✗ | Partial | Build yourself | ✓ |
| Seasonal pattern awareness | ✗ | ✓ | Build yourself | ✓ |
| PostgreSQL-specific models | ✗ | ✗ | ✗ | ✓ |
| Predictive alerts (leading indicators) | ✗ | ✗ | ✗ | ✓ |
| On-premise / self-hosted | ✓ | ✗ | ✓ | ✓ |
| Cost per server | Free | $70-150/mo | Infrastructure | Included |
The Bottom Line
Traditional alert: "CPU is 85% (threshold exceeded)"
Datadog alert: "CPU is 3σ above baseline for Tuesday 2 PM"
PG Monitoring alert: "Replication lag pattern suggests standby disk saturation in ~20 minutes based on current I/O trends and historical correlation. Recommend investigating disk queue depth."