📊Management & Governance

Amazon CloudWatch

Monitoring and observability service for AWS resources and applications

CloudWatch is like having a dashboard in your car that shows speed, fuel level, engine temperature, and alerts you when something's wrong. For AWS, CloudWatch collects metrics (CPU usage, disk I/O, network traffic) from all your resources and displays them in dashboards. It can alert you when things go wrong: 'CPU above 80% for 5 minutes' triggers an SNS notification. CloudWatch also collects logs from applications, Lambda functions, and AWS services, making it your central monitoring hub. Think of it as the eyes and ears of your AWS infrastructure, constantly watching and ready to alert you to problems.

CloudWatch collects metrics (time-series data points) from AWS services automatically and custom metrics from applications via PutMetricData API. Metrics have dimensions (key-value pairs for filtering) and statistics (average, sum, min, max). CloudWatch Alarms trigger actions (SNS, Auto Scaling, EC2 actions) when metrics breach thresholds. CloudWatch Logs collects log data from EC2, Lambda, CloudTrail, and custom sources, with log groups, streams, and retention policies.

Key Capabilities

Collects metrics (numerical time-series), logs (text streams), and traces (via X-Ray integration) from AWS services, EC2 instances, and custom application instrumentation
Alarms evaluate metric thresholds or anomaly detection bands and trigger actions including Auto Scaling policy changes, SNS notifications, and EC2 instance actions
Logs Insights provides an ad-hoc query language for interactive analysis of log data across log groups
Metric filters extract numeric metrics from log patterns, turning unstructured log output into actionable CloudWatch metrics
Container Insights and Lambda Insights provide pre-built metrics and dashboards for ECS, EKS, and Lambda workloads without custom instrumentation
High-resolution metrics support 1-second granularity; standard metrics default to 1-minute (EC2 detailed monitoring) or 5-minute granularity

Gotchas & Constraints

Gotcha #1: Detailed monitoring (1-minute intervals) costs extra; basic monitoring (5-minute intervals) is free for most services. Gotcha #2: CloudWatch Logs storage costs add up; set retention policies and use S3 export for long-term storage. Constraints: Metrics are retained for 15 months, custom metrics limited to 10 dimensions, and alarm evaluation can take 1-2 minutes.

An e-commerce site runs on Auto Scaling EC2 instances behind an ALB. They create CloudWatch dashboards showing ALB request count, target response time, EC2 CPU/memory, and RDS connections. They set up alarms: if ALB 5xx errors exceed 10 in 5 minutes, send SNS alert to on-call engineer; if average CPU exceeds 70%, trigger Auto Scaling to add instances. For application logs, they install CloudWatch agent on EC2 instances to stream application logs to CloudWatch Logs. They use CloudWatch Insights to query logs: 'show all 500 errors in the last hour' or 'count requests by user agent.' During a production incident (database connection pool exhausted), CloudWatch Logs show the exact error messages, and CloudWatch metrics reveal RDS connections maxed out. They increase the connection pool size and create a new alarm to prevent recurrence.

The Result

proactive monitoring, faster incident response, and data-driven capacity planning.

Official AWS Documentation