The Query Monitoring Gap
Most PostgreSQL monitoring tools fall into two categories:
- Log analyzers (pgBadger, pganalyze): Post-hoc analysis, hours of delay
- Stats viewers (pg_stat_statements): Current state only, no trends
Neither tells you what's happening right now or what's about to happen.
What's Missing from Traditional Tools
pg_stat_statements Limitations
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY total_time DESC;
Problems:
- Normalized queries lose parameter values
- No real-time view (cumulative since reset)
- No query plan analysis
- No trend tracking (getting better or worse?)
- No correlation with system metrics
pgBadger: Powerful but Slow
pgBadger analyzes log files and generates excellent reports. But:
- Requires log rotation and file transfer
- Hours of delay between event and report
- Heavy disk I/O on production server
- Complex setup for continuous monitoring
pganalyze: Better, But Expensive
pganalyze adds visualization but:
- SaaS-only: Your data leaves your infrastructure
- $149+/month for basic features
- Limited real-time capabilities
PG Monitoring's Real-Time Query Intelligence
1. Continuous Query Sampling
Lightweight agent queries pg_stat_activity every second:
Capture: Query text, execution phase, wait events, rows/second
Overhead: <0.5% CPU on monitored instance
Storage: Rolling 30-day window with compression
2. Query Fingerprinting & Normalization
Groups similar queries while preserving what's important:
Query: SELECT * FROM orders WHERE user_id = 12345 AND status = 'pending'
Fingerprint: SELECT * FROM orders WHERE user_id = ? AND status = ?
Tracked separately:
- user_id = 12345 (specific customer)
- Execution plan changes
- Duration percentiles (p50, p95, p99)
3. Automatic Regression Detection
Know when query performance degrades:
Query: orders_by_user_status
History: p95 duration 45ms for 30 days
Today: p95 duration 180ms (300% increase)
Alert triggered with:
- Execution plan change detected
- Sequential scan started (was index-only)
- Correlation: Table bloat increased 40% simultaneously
4. Wait Event Correlation
Query slow? PG Monitoring shows exactly why:
- IO:BufferRead - Reading from disk (check cache ratio)
- Lock:tuple - Row contention (transaction design issue)
- ClientRead - Application not fetching results
- Activity:Vacuum - Autovacuum blocking query
Case Study: Finding the Real Culprit
A dashboard was loading slowly (3+ seconds). Initial suspicion: missing index.
PG Monitoring analysis:
Query: SELECT ... FROM complex_report_view
Duration: 2,847ms
BUT wait events showed:
- 2,800ms: ClientRead (waiting for app to fetch)
- 47ms: actual query execution
Root cause: Application using unbuffered cursor
fetching 50,000 rows one at a time
An index would have wasted disk space. The fix was application-side cursor buffering.
Feature Comparison
| Feature | pg_stat_statements | pgBadger | pganalyze | PG Monitoring |
|---|---|---|---|---|
| Real-time query view | Current only | No | Limited | ✓ |
| Historical trends | Manual reset | From logs | ✓ | ✓ |
| Wait event breakdown | ✗ | ✗ | ✓ | ✓ |
| Automatic regression alerts | ✗ | ✗ | ✗ | ✓ |
| Query plan tracking | ✗ | ✗ | ✗ | ✓ |
| Correlation with metrics | ✗ | ✗ | Partial | ✓ |
| On-premise deployment | ✓ | ✓ | ✗ | ✓ |
The Bottom Line
Traditional tools show you "Query X took 2 seconds."
PG Monitoring shows: "Query X took 2 seconds because of ClientRead wait - the application is slow to consume results. Query execution was only 45ms. Index wouldn't help. Fix application cursor buffering."