The Multi-Instance Monitoring Nightmare
If you manage multiple PostgreSQL instances, you know the pain:
- Different monitoring tools per environment
- No correlation between related instances
- Context switching between dashboards
- Can't see organization-wide trends
Traditional Approaches
Per-Instance Monitoring (pgAdmin, DBeaver)
Connect to each instance individually:
- Check primary: Everything looks fine
- Check standby 1: Lag 5 seconds
- Check standby 2: Lag 45 seconds
- Manual correlation: "Oh, they're related!"
Wastes time and misses patterns.
SaaS Monitoring (Datadog, New Relic)
Centralized dashboards, but:
- Expensive per-server pricing ($70-150/instance)
- Your data leaves your infrastructure
- Generic database monitoring (not PostgreSQL-specific)
- Limited cross-instance analysis
Self-Hosted Prometheus + Grafana
Build your own solution:
- Requires significant setup and maintenance
- PostgreSQL exporters don't capture everything
- No built-in replication correlation
- Alerting rules are complex to maintain
PG Monitoring's Unified Multi-Tenant Dashboard
1. Organization-Wide Health View
Instance Health Overview (12 instances)
Production (4):
├─ prod-primary-01: ✅ Healthy (97/100)
├─ prod-replica-01: ⚠️ Warning (82/100) - Lag 12s
├─ prod-replica-02: ✅ Healthy (94/100) - Lag 0.2s
└─ prod-analytics: ✅ Healthy (91/100)
Staging (4):
├─ staging-primary: ✅ Healthy (89/100)
└─ staging-replicas (3): ✅ All healthy
Development (4):
├─ dev-primary: ⚠️ Warning (74/100) - Bloat 45%
└─ dev-replicas (3): ✅ All healthy
Correlated Alert: prod-replica-01 lag spike correlates
with dev-primary vacuum activity (shared infrastructure)
2. Cross-Instance Replication Topology
Replication Topology View:
prod-primary-01 (10.0.1.10)
├──► prod-replica-01 (10.0.1.11) [Sync] Lag: 12s ⚠️
├──► prod-replica-02 (10.0.2.11) [Async] Lag: 0.2s ✅
└──► prod-analytics (10.0.3.15) [Logical] Lag: 45s ⚠️
Issues:
- prod-replica-01: Write lag normal, replay lag high
→ Disk I/O bottleneck on standby
- prod-analytics: Slot bloat detected
→ Inactive replication slot retaining 2.3GB WAL
→ Risk: Disk fill within 18 hours
3. Organization-Level Insights
Monthly Organization Report:
Query Performance:
- Instances with regression: 2/12
- Slowest instance: prod-analytics (p95: 2.3s)
- Fastest improvement: dev-primary (after index recommendations)
Resource Utilization:
- Highest CPU: prod-primary-01 (72% avg)
- Lowest cache hit: dev-primary (78% - needs attention)
- Disk pressure: 3 instances > 70% capacity
Security Posture:
- SSL enforced: 11/12 instances ✅
- Superuser access restricted: 10/12 ⚠️
- Password policies: 8/12 compliant ⚠️
Cost Optimization:
- 3 instances oversized (CPU < 30%)
- 1 instance undersized (CPU > 85% sustained)
- Estimated savings: $420/month with rightsizing
4. Team-Based Access Control
Organization: Acme Corp
Teams:
├─ Platform Engineering
│ └─ Instances: prod-*, staging-* (read)
│ └─ Permissions: Full admin
│
├─ Application Team A
│ └─ Instances: prod-primary-01, prod-replica-01
│ └─ Permissions: Read-only queries, no config changes
│
├─ Analytics Team
│ └─ Instances: prod-analytics
│ └─ Permissions: Query analysis, index recommendations
│
└─ QA Team
└─ Instances: staging-*, dev-*
└─ Permissions: Full access (non-production)
Audit: All access logged with team attribution
Feature Comparison
| Feature | pgAdmin | Datadog | Prometheus/Grafana | PG Monitoring |
|---|---|---|---|---|
| Multi-instance dashboard | ✗ | ✓ | Build yourself | ✓ |
| Replication topology view | ✗ | Partial | Build yourself | ✓ |
| Cross-instance correlation | ✗ | ✗ | ✗ | ✓ |
| Organization-wide reporting | ✗ | Partial | ✗ | ✓ |
| Team-based access control | ✗ | ✓ | ✗ | ✓ |
| Cost optimization insights | ✗ | ✗ | ✗ | ✓ |
| On-premise / self-hosted | ✓ | ✗ | ✓ | ✓ |
| Cost per instance | Free | $70-150/mo | Infrastructure | Flat rate |
Case Study: The Cross-Instance Cascade Failure
A company had 8 PostgreSQL instances across dev, staging, and prod. One Friday:
Timeline of cascade failure:
14:23 - dev-primary vacuum started (weekly maintenance)
14:25 - dev-replica lag increasing (normal during vacuum)
14:30 - staging-primary I/O spike detected
14:35 - prod-replica-01 lag spike (12s → 45s)
14:40 - Support ticket: "Production is slow!"
Root cause (discovered after 2 hours):
- All instances share same SAN storage
- dev-primary vacuum saturated disk I/O
- cascaded to staging, then production replicas
- No single tool showed the correlation
With PG Monitoring: Cross-instance correlation would have identified the shared infrastructure dependency immediately. Alert: "I/O saturation on dev-primary correlates with lag spikes on staging and production instances. Shared SAN detected."
The Bottom Line
Traditional tools: "Instance X has a problem." "Instance Y has a problem." (No connection between them)
PG Monitoring: "I/O saturation on dev-primary (non-critical) is causing replica lag on staging and production instances due to shared SAN infrastructure. Recommend: reschedule dev maintenance or separate storage. All 3 instances affected - org-wide impact."