Building a Centralized Observability Platform for Multi-Cluster Kubernetes

The Challenge

When you’re managing multiple Kubernetes clusters in production, observability becomes a critical challenge. Each cluster generates its own metrics and logs, and without a unified view, troubleshooting becomes a game of jumping between dashboards and SSH sessions.

At Acefone, I was tasked with building a centralized observability platform that could aggregate data from all our clusters into a single pane of glass.

Architecture Overview

The solution I designed uses a hub-and-spoke model:

Spoke clusters: Each Kubernetes cluster runs Grafana Alloy as a DaemonSet, collecting metrics (via Prometheus remote write) and logs (via Loki’s push API)
Hub cluster: A dedicated monitoring cluster hosts the central Prometheus, Loki, and Grafana stack

Why Grafana Alloy?

Grafana Alloy (the successor to Grafana Agent) was chosen because:

Lightweight footprint compared to running full Prometheus instances on each cluster
Native support for Prometheus remote write and Loki push
Built-in service discovery for Kubernetes workloads
Pipeline-based configuration that’s easy to maintain as code

Key Implementation Details

Metrics Pipeline

# Alloy configuration for metrics collection
prometheus.scrape "pods" {
  targets    = discovery.kubernetes.pods.targets
  forward_to = [prometheus.remote_write.central.receiver]
}

prometheus.remote_write "central" {
  endpoint {
    url = "https://prometheus-central.internal/api/v1/write"
  }
}

Alerting with Teams Integration

I implemented production-grade alerting using Prometheus Alertmanager with Microsoft Teams webhook integration. Alerts are categorized by severity:

Critical: CPU > 90% for 5 minutes, Memory > 85% for 5 minutes
Warning: CPU > 75% for 10 minutes, pod restart count > 3 in 15 minutes
Info: Deployment rollout status changes

Results

MTTR reduced by 40% — engineers can now identify issues across clusters from a single Grafana dashboard
Proactive incident management — alerting catches issues before they impact users
Standardized observability — every cluster follows the same monitoring patterns

Lessons Learned

Start with dashboards, not alerts — understand your baseline before setting thresholds
Label everything consistently — cluster name, namespace, and environment labels are non-negotiable
Test your alerting pipeline end-to-end — a silent alert is worse than no alert
Capacity plan for the central cluster — aggregating data from multiple clusters requires significant storage and compute

The centralized observability platform has become the backbone of our operational visibility, and the patterns established here are now our standard for any new cluster we deploy.