Multi-Tenant Monitoring: One Dashboard, All Instances

The Multi-Instance Monitoring Nightmare

If you manage multiple PostgreSQL instances, you know the pain:

Different monitoring tools per environment
No correlation between related instances
Context switching between dashboards
Can't see organization-wide trends

Traditional Approaches

Per-Instance Monitoring (pgAdmin, DBeaver)

Connect to each instance individually:

Check primary: Everything looks fine
Check standby 1: Lag 5 seconds
Check standby 2: Lag 45 seconds
Manual correlation: "Oh, they're related!"

Wastes time and misses patterns.

SaaS Monitoring (Datadog, New Relic)

Centralized dashboards, but:

Expensive per-server pricing ($70-150/instance)
Your data leaves your infrastructure
Generic database monitoring (not PostgreSQL-specific)
Limited cross-instance analysis

Self-Hosted Prometheus + Grafana

Build your own solution:

Requires significant setup and maintenance
PostgreSQL exporters don't capture everything
No built-in replication correlation
Alerting rules are complex to maintain

PG Monitoring's Unified Multi-Tenant Dashboard

1. Organization-Wide Health View

Instance Health Overview (12 instances)

Production (4):
├─ prod-primary-01: ✅ Healthy (97/100)
├─ prod-replica-01: ⚠️ Warning (82/100) - Lag 12s
├─ prod-replica-02: ✅ Healthy (94/100) - Lag 0.2s
└─ prod-analytics: ✅ Healthy (91/100)

Staging (4):
├─ staging-primary: ✅ Healthy (89/100)
└─ staging-replicas (3): ✅ All healthy

Development (4):
├─ dev-primary: ⚠️ Warning (74/100) - Bloat 45%
└─ dev-replicas (3): ✅ All healthy

Correlated Alert: prod-replica-01 lag spike correlates
with dev-primary vacuum activity (shared infrastructure)

2. Cross-Instance Replication Topology

Replication Topology View:

prod-primary-01 (10.0.1.10)
├──► prod-replica-01 (10.0.1.11) [Sync] Lag: 12s ⚠️
├──► prod-replica-02 (10.0.2.11) [Async] Lag: 0.2s ✅
└──► prod-analytics (10.0.3.15) [Logical] Lag: 45s ⚠️

Issues:
- prod-replica-01: Write lag normal, replay lag high
  → Disk I/O bottleneck on standby
  
- prod-analytics: Slot bloat detected
  → Inactive replication slot retaining 2.3GB WAL
  → Risk: Disk fill within 18 hours

3. Organization-Level Insights

Monthly Organization Report:

Query Performance:
- Instances with regression: 2/12
- Slowest instance: prod-analytics (p95: 2.3s)
- Fastest improvement: dev-primary (after index recommendations)

Resource Utilization:
- Highest CPU: prod-primary-01 (72% avg)
- Lowest cache hit: dev-primary (78% - needs attention)
- Disk pressure: 3 instances > 70% capacity

Security Posture:
- SSL enforced: 11/12 instances ✅
- Superuser access restricted: 10/12 ⚠️
- Password policies: 8/12 compliant ⚠️

Cost Optimization:
- 3 instances oversized (CPU < 30%)
- 1 instance undersized (CPU > 85% sustained)
- Estimated savings: $420/month with rightsizing

4. Team-Based Access Control

Organization: Acme Corp

Teams:
├─ Platform Engineering
│  └─ Instances: prod-*, staging-* (read)
│  └─ Permissions: Full admin
│
├─ Application Team A
│  └─ Instances: prod-primary-01, prod-replica-01
│  └─ Permissions: Read-only queries, no config changes
│
├─ Analytics Team
│  └─ Instances: prod-analytics
│  └─ Permissions: Query analysis, index recommendations
│
└─ QA Team
   └─ Instances: staging-*, dev-*
   └─ Permissions: Full access (non-production)

Audit: All access logged with team attribution

Feature Comparison

Feature	pgAdmin	Datadog	Prometheus/Grafana	PG Monitoring
Multi-instance dashboard	✗	✓	Build yourself	✓
Replication topology view	✗	Partial	Build yourself	✓
Cross-instance correlation	✗	✗	✗	✓
Organization-wide reporting	✗	Partial	✗	✓
Team-based access control	✗	✓	✗	✓
Cost optimization insights	✗	✗	✗	✓
On-premise / self-hosted	✓	✗	✓	✓
Cost per instance	Free	$70-150/mo	Infrastructure	Flat rate

Case Study: The Cross-Instance Cascade Failure

A company had 8 PostgreSQL instances across dev, staging, and prod. One Friday:

Timeline of cascade failure:

14:23 - dev-primary vacuum started (weekly maintenance)
14:25 - dev-replica lag increasing (normal during vacuum)
14:30 - staging-primary I/O spike detected
14:35 - prod-replica-01 lag spike (12s → 45s)
14:40 - Support ticket: "Production is slow!"

Root cause (discovered after 2 hours):
- All instances share same SAN storage
- dev-primary vacuum saturated disk I/O
- cascaded to staging, then production replicas
- No single tool showed the correlation

With PG Monitoring: Cross-instance correlation would have identified the shared infrastructure dependency immediately. Alert: "I/O saturation on dev-primary correlates with lag spikes on staging and production instances. Shared SAN detected."

The Bottom Line

Traditional tools: "Instance X has a problem." "Instance Y has a problem." (No connection between them)

PG Monitoring: "I/O saturation on dev-primary (non-critical) is causing replica lag on staging and production instances due to shared SAN infrastructure. Recommend: reschedule dev maintenance or separate storage. All 3 instances affected - org-wide impact."