AI/ML

Predictive Anomaly Detection: Before Problems Happen

PG Monitoring Team April 03, 2026 7 min read

The Problem with Threshold-Based Alerting

Traditional monitoring uses static thresholds:

  • "Alert if CPU > 80%"
  • "Alert if connections > 90%"
  • "Alert if replication lag > 30s"

These alerts have two fatal flaws:

  1. False positives: 80% CPU during nightly batch job is normal
  2. False negatives: 40% CPU when normal is 15% is a problem, but doesn't alert

How Other Tools Handle Anomaly Detection

Datadog / New Relic APM

Add PostgreSQL plugins with "baseline" detection:

  • Learns patterns over days/weeks
  • Alerts on deviations from baseline
  • Requires external SaaS (data leaves premises)
  • Generic - not PostgreSQL-specific

Cost: $70-150/server/month

TimescaleDB / Prometheus + Grafana

Build your own with recording rules:

  • Stores time-series data
  • Supports complex queries for anomaly detection
  • Requires expertise to build ML models
  • No built-in PostgreSQL understanding

Complexity: High. Most teams never implement effective ML alerting.

pgDash

Offers "intelligent alerting" but:

  • Simple statistical methods only
  • No cross-metric correlation
  • Reactive, not predictive

PG Monitoring's ML-Powered Anomaly Detection

1. Multi-Dimensional Pattern Learning

Not just "is CPU high?" but "is this combination of metrics abnormal?"

Learned pattern for Tuesday 14:00:
- CPU: 45% ± 12%
- Active connections: 67 ± 8
- Cache hit ratio: 96% ± 2%
- Transaction rate: 1,200/sec ± 200
- Replication lag: 0.1s ± 0.05s

Current reading:
- CPU: 48% (normal)
- Connections: 69 (normal)
- Cache hit: 94% (normal)
- Replication lag: 2.3s (ABNORMAL - 46σ from mean)

Alert: Replication lag anomaly detected
Root cause analysis: Standby disk I/O saturation

2. Leading Indicator Detection

Find metrics that predict problems before they happen:

Historical analysis shows:
- Checkpoint write time increase → 15min later: query slowdown
- Dead tuple ratio growth → 45min later: autovacuum storm
- Connection wait queue growth → 5min later: connection errors

Predictive alert triggered:
"Checkpoint write time up 300% - expect query performance 
degradation in ~15 minutes. Investigate disk I/O now."

3. Seasonal Pattern Recognition

Weekday 9 AM spike is normal. Sunday 3 AM spike is not.

Learned daily patterns:
- 09:00: Login spike (80% above baseline) - EXPECTED
- 14:00: Analytics batch job - EXPECTED  
- 03:00 Sunday: Connection spike - UNEXPECTED

Anomaly score: 94/100
Investigation: Unauthorized API access detected

Case Study: Predicting the Unpredictable

A retail customer's database would mysteriously slow down every few weeks.

Traditional monitoring: Showed high CPU and connection errors when already slow.

PG Monitoring anomaly detection found:

Pattern detected 3 weeks before each incident:
- Dead tuple ratio grew 5% daily for 7 days
- Autovacuum couldn't keep up
- Table bloat crossed 40% threshold
- Query plans changed from index to seq scan
- Performance collapsed

Root cause: Long-running transaction preventing vacuum
(predictable, but threshold alerts never caught the pattern)

Fix: Added monitoring for long-running transactions. Now auto-terminates transactions > 2 hours.

Feature Comparison

FeatureStatic ThresholdsDatadogTimescaleDB DIYPG Monitoring
Baseline learningBuild yourself
Multi-metric correlationPartialBuild yourself
Seasonal pattern awarenessBuild yourself
PostgreSQL-specific models
Predictive alerts (leading indicators)
On-premise / self-hosted
Cost per serverFree$70-150/moInfrastructureIncluded

The Bottom Line

Traditional alert: "CPU is 85% (threshold exceeded)"

Datadog alert: "CPU is 3σ above baseline for Tuesday 2 PM"

PG Monitoring alert: "Replication lag pattern suggests standby disk saturation in ~20 minutes based on current I/O trends and historical correlation. Recommend investigating disk queue depth."

Related Articles

Ready to experience better PostgreSQL monitoring?

Join thousands of teams who switched from traditional tools to PG Monitoring's AI-powered platform.

Talk to us