Predictive Anomaly Detection: Before Problems Happen

The Problem with Threshold-Based Alerting

Traditional monitoring uses static thresholds:

"Alert if CPU > 80%"
"Alert if connections > 90%"
"Alert if replication lag > 30s"

These alerts have two fatal flaws:

False positives: 80% CPU during nightly batch job is normal
False negatives: 40% CPU when normal is 15% is a problem, but doesn't alert

How Other Tools Handle Anomaly Detection

Datadog / New Relic APM

Add PostgreSQL plugins with "baseline" detection:

Learns patterns over days/weeks
Alerts on deviations from baseline
Requires external SaaS (data leaves premises)
Generic - not PostgreSQL-specific

Cost: $70-150/server/month

TimescaleDB / Prometheus + Grafana

Build your own with recording rules:

Stores time-series data
Supports complex queries for anomaly detection
Requires expertise to build ML models
No built-in PostgreSQL understanding

Complexity: High. Most teams never implement effective ML alerting.

pgDash

Offers "intelligent alerting" but:

Simple statistical methods only
No cross-metric correlation
Reactive, not predictive

PG Monitoring's ML-Powered Anomaly Detection

1. Multi-Dimensional Pattern Learning

Not just "is CPU high?" but "is this combination of metrics abnormal?"

Learned pattern for Tuesday 14:00:
- CPU: 45% ± 12%
- Active connections: 67 ± 8
- Cache hit ratio: 96% ± 2%
- Transaction rate: 1,200/sec ± 200
- Replication lag: 0.1s ± 0.05s

Current reading:
- CPU: 48% (normal)
- Connections: 69 (normal)
- Cache hit: 94% (normal)
- Replication lag: 2.3s (ABNORMAL - 46σ from mean)

Alert: Replication lag anomaly detected
Root cause analysis: Standby disk I/O saturation

2. Leading Indicator Detection

Find metrics that predict problems before they happen:

Historical analysis shows:
- Checkpoint write time increase → 15min later: query slowdown
- Dead tuple ratio growth → 45min later: autovacuum storm
- Connection wait queue growth → 5min later: connection errors

Predictive alert triggered:
"Checkpoint write time up 300% - expect query performance 
degradation in ~15 minutes. Investigate disk I/O now."

3. Seasonal Pattern Recognition

Weekday 9 AM spike is normal. Sunday 3 AM spike is not.

Learned daily patterns:
- 09:00: Login spike (80% above baseline) - EXPECTED
- 14:00: Analytics batch job - EXPECTED  
- 03:00 Sunday: Connection spike - UNEXPECTED

Anomaly score: 94/100
Investigation: Unauthorized API access detected

Case Study: Predicting the Unpredictable

A retail customer's database would mysteriously slow down every few weeks.

Traditional monitoring: Showed high CPU and connection errors when already slow.

PG Monitoring anomaly detection found:

Pattern detected 3 weeks before each incident:
- Dead tuple ratio grew 5% daily for 7 days
- Autovacuum couldn't keep up
- Table bloat crossed 40% threshold
- Query plans changed from index to seq scan
- Performance collapsed

Root cause: Long-running transaction preventing vacuum
(predictable, but threshold alerts never caught the pattern)

Fix: Added monitoring for long-running transactions. Now auto-terminates transactions > 2 hours.

Feature Comparison

Feature	Static Thresholds	Datadog	TimescaleDB DIY	PG Monitoring
Baseline learning	✗	✓	Build yourself	✓
Multi-metric correlation	✗	Partial	Build yourself	✓
Seasonal pattern awareness	✗	✓	Build yourself	✓
PostgreSQL-specific models	✗	✗	✗	✓
Predictive alerts (leading indicators)	✗	✗	✗	✓
On-premise / self-hosted	✓	✗	✓	✓
Cost per server	Free	$70-150/mo	Infrastructure	Included

The Bottom Line

Traditional alert: "CPU is 85% (threshold exceeded)"

Datadog alert: "CPU is 3σ above baseline for Tuesday 2 PM"

PG Monitoring alert: "Replication lag pattern suggests standby disk saturation in ~20 minutes based on current I/O trends and historical correlation. Recommend investigating disk queue depth."

Predictive Anomaly Detection: Before Problems Happen

The Problem with Threshold-Based Alerting

How Other Tools Handle Anomaly Detection

Datadog / New Relic APM

TimescaleDB / Prometheus + Grafana

pgDash

PG Monitoring's ML-Powered Anomaly Detection

1. Multi-Dimensional Pattern Learning

2. Leading Indicator Detection

3. Seasonal Pattern Recognition

Case Study: Predicting the Unpredictable

Feature Comparison

The Bottom Line

Share this article

Related Articles

PostgreSQL Copilot: AI Assistant That Understands Your Database

PostgreSQL generate_series: preencha lacunas de tempo, crie calendários e dados de teste

PostgreSQL date_trunc: buckets de tempo sem quebrar índices

Ready to experience better PostgreSQL monitoring?