Yogita Sharma

Yogita Sharma

Platform Engineer

AWS | Kubernetes | Terraform | CI/CD

Gurugram, India

Back to Blog

Building a Centralized Observability Platform for Multi-Cluster Kubernetes

· 8 min read
Kubernetes Prometheus Grafana Observability

The Challenge

When you’re managing multiple Kubernetes clusters in production, observability becomes a critical challenge. Each cluster generates its own metrics and logs, and without a unified view, troubleshooting becomes a game of jumping between dashboards and SSH sessions.

At Acefone, I was tasked with building a centralized observability platform that could aggregate data from all our clusters into a single pane of glass.

Architecture Overview

The solution I designed uses a hub-and-spoke model:

  • Spoke clusters: Each Kubernetes cluster runs Grafana Alloy as a DaemonSet, collecting metrics (via Prometheus remote write) and logs (via Loki’s push API)
  • Hub cluster: A dedicated monitoring cluster hosts the central Prometheus, Loki, and Grafana stack

Why Grafana Alloy?

Grafana Alloy (the successor to Grafana Agent) was chosen because:

  • Lightweight footprint compared to running full Prometheus instances on each cluster
  • Native support for Prometheus remote write and Loki push
  • Built-in service discovery for Kubernetes workloads
  • Pipeline-based configuration that’s easy to maintain as code

Key Implementation Details

Metrics Pipeline

# Alloy configuration for metrics collection
prometheus.scrape "pods" {
  targets    = discovery.kubernetes.pods.targets
  forward_to = [prometheus.remote_write.central.receiver]
}

prometheus.remote_write "central" {
  endpoint {
    url = "https://prometheus-central.internal/api/v1/write"
  }
}

Alerting with Teams Integration

I implemented production-grade alerting using Prometheus Alertmanager with Microsoft Teams webhook integration. Alerts are categorized by severity:

  • Critical: CPU > 90% for 5 minutes, Memory > 85% for 5 minutes
  • Warning: CPU > 75% for 10 minutes, pod restart count > 3 in 15 minutes
  • Info: Deployment rollout status changes

Results

  • MTTR reduced by 40% — engineers can now identify issues across clusters from a single Grafana dashboard
  • Proactive incident management — alerting catches issues before they impact users
  • Standardized observability — every cluster follows the same monitoring patterns

Lessons Learned

  1. Start with dashboards, not alerts — understand your baseline before setting thresholds
  2. Label everything consistently — cluster name, namespace, and environment labels are non-negotiable
  3. Test your alerting pipeline end-to-end — a silent alert is worse than no alert
  4. Capacity plan for the central cluster — aggregating data from multiple clusters requires significant storage and compute

The centralized observability platform has become the backbone of our operational visibility, and the patterns established here are now our standard for any new cluster we deploy.