Enterprise

Multi-Tenant Monitoring: One Dashboard, All Instances

PG Monitoring Team March 20, 2026 6 min read

The Multi-Instance Monitoring Nightmare

If you manage multiple PostgreSQL instances, you know the pain:

  • Different monitoring tools per environment
  • No correlation between related instances
  • Context switching between dashboards
  • Can't see organization-wide trends

Traditional Approaches

Per-Instance Monitoring (pgAdmin, DBeaver)

Connect to each instance individually:

  • Check primary: Everything looks fine
  • Check standby 1: Lag 5 seconds
  • Check standby 2: Lag 45 seconds
  • Manual correlation: "Oh, they're related!"

Wastes time and misses patterns.

SaaS Monitoring (Datadog, New Relic)

Centralized dashboards, but:

  • Expensive per-server pricing ($70-150/instance)
  • Your data leaves your infrastructure
  • Generic database monitoring (not PostgreSQL-specific)
  • Limited cross-instance analysis

Self-Hosted Prometheus + Grafana

Build your own solution:

  • Requires significant setup and maintenance
  • PostgreSQL exporters don't capture everything
  • No built-in replication correlation
  • Alerting rules are complex to maintain

PG Monitoring's Unified Multi-Tenant Dashboard

1. Organization-Wide Health View

Instance Health Overview (12 instances)

Production (4):
├─ prod-primary-01: ✅ Healthy (97/100)
├─ prod-replica-01: ⚠️ Warning (82/100) - Lag 12s
├─ prod-replica-02: ✅ Healthy (94/100) - Lag 0.2s
└─ prod-analytics: ✅ Healthy (91/100)

Staging (4):
├─ staging-primary: ✅ Healthy (89/100)
└─ staging-replicas (3): ✅ All healthy

Development (4):
├─ dev-primary: ⚠️ Warning (74/100) - Bloat 45%
└─ dev-replicas (3): ✅ All healthy

Correlated Alert: prod-replica-01 lag spike correlates
with dev-primary vacuum activity (shared infrastructure)

2. Cross-Instance Replication Topology

Replication Topology View:

prod-primary-01 (10.0.1.10)
├──► prod-replica-01 (10.0.1.11) [Sync] Lag: 12s ⚠️
├──► prod-replica-02 (10.0.2.11) [Async] Lag: 0.2s ✅
└──► prod-analytics (10.0.3.15) [Logical] Lag: 45s ⚠️

Issues:
- prod-replica-01: Write lag normal, replay lag high
  → Disk I/O bottleneck on standby
  
- prod-analytics: Slot bloat detected
  → Inactive replication slot retaining 2.3GB WAL
  → Risk: Disk fill within 18 hours

3. Organization-Level Insights

Monthly Organization Report:

Query Performance:
- Instances with regression: 2/12
- Slowest instance: prod-analytics (p95: 2.3s)
- Fastest improvement: dev-primary (after index recommendations)

Resource Utilization:
- Highest CPU: prod-primary-01 (72% avg)
- Lowest cache hit: dev-primary (78% - needs attention)
- Disk pressure: 3 instances > 70% capacity

Security Posture:
- SSL enforced: 11/12 instances ✅
- Superuser access restricted: 10/12 ⚠️
- Password policies: 8/12 compliant ⚠️

Cost Optimization:
- 3 instances oversized (CPU < 30%)
- 1 instance undersized (CPU > 85% sustained)
- Estimated savings: $420/month with rightsizing

4. Team-Based Access Control

Organization: Acme Corp

Teams:
├─ Platform Engineering
│  └─ Instances: prod-*, staging-* (read)
│  └─ Permissions: Full admin
│
├─ Application Team A
│  └─ Instances: prod-primary-01, prod-replica-01
│  └─ Permissions: Read-only queries, no config changes
│
├─ Analytics Team
│  └─ Instances: prod-analytics
│  └─ Permissions: Query analysis, index recommendations
│
└─ QA Team
   └─ Instances: staging-*, dev-*
   └─ Permissions: Full access (non-production)

Audit: All access logged with team attribution

Feature Comparison

FeaturepgAdminDatadogPrometheus/GrafanaPG Monitoring
Multi-instance dashboardBuild yourself
Replication topology viewPartialBuild yourself
Cross-instance correlation
Organization-wide reportingPartial
Team-based access control
Cost optimization insights
On-premise / self-hosted
Cost per instanceFree$70-150/moInfrastructureFlat rate

Case Study: The Cross-Instance Cascade Failure

A company had 8 PostgreSQL instances across dev, staging, and prod. One Friday:

Timeline of cascade failure:

14:23 - dev-primary vacuum started (weekly maintenance)
14:25 - dev-replica lag increasing (normal during vacuum)
14:30 - staging-primary I/O spike detected
14:35 - prod-replica-01 lag spike (12s → 45s)
14:40 - Support ticket: "Production is slow!"

Root cause (discovered after 2 hours):
- All instances share same SAN storage
- dev-primary vacuum saturated disk I/O
- cascaded to staging, then production replicas
- No single tool showed the correlation

With PG Monitoring: Cross-instance correlation would have identified the shared infrastructure dependency immediately. Alert: "I/O saturation on dev-primary correlates with lag spikes on staging and production instances. Shared SAN detected."

The Bottom Line

Traditional tools: "Instance X has a problem." "Instance Y has a problem." (No connection between them)

PG Monitoring: "I/O saturation on dev-primary (non-critical) is causing replica lag on staging and production instances due to shared SAN infrastructure. Recommend: reschedule dev maintenance or separate storage. All 3 instances affected - org-wide impact."

Related Articles

Ready to experience better PostgreSQL monitoring?

Join thousands of teams who switched from traditional tools to PG Monitoring's AI-powered platform.

Talk to us