Josh Menzies | Platform Engineer

Implementing Datadog and OpenTelemetry across 200+ EKS services, improving detection latency by 40% and MTTR by 25%. Learn how we standardized observability patterns, reduced alert fatigue, and enabled data-driven incident response.

Implementing observability across a multi-account AWS organization with 200+ EKS services requires careful planning, standardization, and automation. This post covers how we implemented Datadog and OpenTelemetry to improve detection latency by 40% and reduce MTTR by 25%.

The Challenge

With services spread across multiple AWS accounts and EKS clusters, we had fragmented observability. Teams used different monitoring tools, alerting was inconsistent, and incident response was slow. We needed a unified observability strategy that worked at scale.

The Solution

We standardized on Datadog for metrics, logs, and APM, and implemented OpenTelemetry for distributed tracing. Key components:

OpenTelemetry auto-instrumentation for all EKS services
Datadog agents deployed via DaemonSet across all clusters
Standardized dashboards and alerting rules
Automated WAF rules for security monitoring
Trace-based incident correlation

The Impact

Detection latency improved by 40%, MTTR decreased by 25%, and we reduced alert fatigue by 60% through intelligent alerting. The unified observability platform now provides real-time visibility across all services, enabling faster incident response and data-driven optimization.

Observability at scale in a multi-account AWS org

The Challenge

The Solution

The Impact