Amazon CloudWatch
Monitoring and observability service for AWS resources and applications
CloudWatch is like having a dashboard in your car that shows speed, fuel level, engine temperature, and alerts you when something's wrong. For AWS, CloudWatch collects metrics (CPU usage, disk I/O, network traffic) from all your resources and displays them in dashboards. It can alert you when things go wrong: 'CPU above 80% for 5 minutes' triggers an SNS notification. CloudWatch also collects logs from applications, Lambda functions, and AWS services, making it your central monitoring hub. Think of it as the eyes and ears of your AWS infrastructure, constantly watching and ready to alert you to problems.
CloudWatch collects metrics (time-series data points) from AWS services automatically and custom metrics from applications via PutMetricData API. Metrics have dimensions (key-value pairs for filtering) and statistics (average, sum, min, max). CloudWatch Alarms trigger actions (SNS, Auto Scaling, EC2 actions) when metrics breach thresholds. CloudWatch Logs collects log data from EC2, Lambda, CloudTrail, and custom sources, with log groups, streams, and retention policies.
Key Capabilities
Key features: CloudWatch Insights (query logs with SQL-like syntax), CloudWatch Dashboards (visualize metrics), CloudWatch Events/EventBridge (react to state changes), and CloudWatch Synthetics (monitor endpoints with canaries).
Gotchas & Constraints
Gotcha #1: Detailed monitoring (1-minute intervals) costs extra; basic monitoring (5-minute intervals) is free for most services. Gotcha #2: CloudWatch Logs storage costs add up; set retention policies and use S3 export for long-term storage. Constraints: Metrics are retained for 15 months, custom metrics limited to 10 dimensions, and alarm evaluation can take 1-2 minutes.
An e-commerce site runs on Auto Scaling EC2 instances behind an ALB. They create CloudWatch dashboards showing ALB request count, target response time, EC2 CPU/memory, and RDS connections. They set up alarms: if ALB 5xx errors exceed 10 in 5 minutes, send SNS alert to on-call engineer; if average CPU exceeds 70%, trigger Auto Scaling to add instances. For application logs, they install CloudWatch agent on EC2 instances to stream application logs to CloudWatch Logs. They use CloudWatch Insights to query logs: 'show all 500 errors in the last hour' or 'count requests by user agent.' During a production incident (database connection pool exhausted), CloudWatch Logs show the exact error messages, and CloudWatch metrics reveal RDS connections maxed out. They increase the connection pool size and create a new alarm to prevent recurrence.
The Result
proactive monitoring, faster incident response, and data-driven capacity planning.