The Connection Problem Nobody Talks About
"Too many connections" is a symptom. The real questions are:
- Which queries hold connections longest?
- Are connections idle, active, or waiting?
- Is the pool too small, or are queries too slow?
- What's the optimal pool size for your workload?
Most monitoring tools just show connection count. That's useless for diagnosis.
What Traditional Tools Show You
pg_stat_activity
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
count
-------
87
Okay, 87 active connections. Now what? Which ones matter? Which are problematic?
CloudWatch / Datadog RDS Metrics
Show: DatabaseConnections metric over time
Alert when: Connections > 80% of max
Problem: By the time you alert, users are already getting "too many connections" errors. You're reactive.
pgBouncer Statistics
If you're using pgBouncer, you get pool stats:
SHOW POOLS;
database | cl_active | cl_waiting | sv_active | sv_idle
----------+-----------+------------+-----------+---------
mydb | 45 | 12 | 20 | 5
Better, but still limited:
- No correlation with query performance
- No wait time analysis
- No optimal sizing recommendations
PG Monitoring's Connection Intelligence
1. Connection State Breakdown
Total connections: 94/100 (94%)
Breakdown:
├─ active: 67 (71%)
│ ├─ idle in transaction: 12 (problematic)
│ ├─ long running (>30s): 8 (investigate)
│ └─ normal queries: 47
├─ idle: 23 (24%)
│ ├─ idle < 1min: 18 (normal)
│ └─ idle > 5min: 5 (leak?)
└─ waiting: 4 (4% - pool bottleneck)
Wait time analysis:
- Average wait: 230ms
- p95 wait: 1.2s
- Max wait: 4.7s (connection storm at 14:23)
2. Query-to-Connection Correlation
Find which queries consume the most connection-seconds:
Top connection consumers (last hour):
1. reports/monthly_summary: 23 connections × 45s avg = 1,035 conn-seconds
2. analytics/user_cohort: 15 connections × 32s avg = 480 conn-seconds
3. api/order_history: 67 connections × 0.8s avg = 54 conn-seconds
Optimization opportunity:
Move reports to read replica or background job.
Would free up 40% of connection capacity.
3. Pool Size Recommendations
Dynamic calculation based on actual usage patterns:
Current: max_connections = 100
Analysis:
- Peak active: 89 (89% utilization)
- Average active: 34
- Wait queue peaks: 3x daily at 09:00, 14:00, 17:00
- Wait time during peaks: 2.1s average
Recommendation:
- Immediate: Increase to 150 (adds headroom)
- Better: Implement connection pooling (pgBouncer)
Suggested: pool_size = 40, max_client_conn = 200
Expected: 60% connection reduction, wait time → 50ms
Cost comparison:
- RDS next tier: +$180/month
- pgBouncer on existing instance: $0
4. Connection Leak Detection
Idle connection analysis:
5 connections idle > 5 minutes:
- App server: web-03
- First seen: 14:23:17 (same time as deployment)
- Query last run: BEGIN; (never committed!)
Diagnosis: Application deployment didn't close transaction
Action: Terminate connections, fix app rollback logic
Case Study: The Connection Storm
A SaaS app kept hitting "too many connections" errors during peak hours.
Initial response: Keep increasing max_connections (100 → 150 → 200)
Problem: PostgreSQL performance degraded with more connections (context switching overhead)
PG Monitoring analysis revealed:
Root cause: 3 analytics queries running synchronously
- Each took 30-45 seconds
- During peak: 20+ concurrent analytics requests
- Connection pile-up: 60+ waiting
- Transaction locks held for 45s → other queries blocked
Real problem: Not connection count, but connection duration
Solution: Move analytics to read replica with dedicated pool
Result:
- Primary connections: 200 → 45
- Wait times: 2.1s → 50ms
- Query performance: +40% (no lock contention)
Feature Comparison
| Feature | pg_stat_activity | CloudWatch | pgBouncer | PG Monitoring |
|---|---|---|---|---|
| Connection count | ✓ | ✓ | ✓ | ✓ |
| State breakdown (active/idle/wait) | ✓ | ✗ | ✓ | ✓ |
| Wait time tracking | ✗ | ✗ | Partial | ✓ |
| Query correlation | Manual | ✗ | ✗ | ✓ |
| Pool sizing recommendations | ✗ | ✗ | ✗ | ✓ |
| Leak detection | ✗ | ✗ | ✗ | ✓ |
| Cost optimization | ✗ | ✗ | ✗ | ✓ |
The Bottom Line
Traditional tools: "You have 94 connections."
PG Monitoring: "94 connections with 4 waiting (230ms avg). Top consumer is reports/monthly_summary using 23 connections for 45 seconds each. 40% of your capacity consumed by 3 queries. Move to read replica to eliminate waits and save $180/month on RDS tier upgrade."