📈Analytics & Business Intelligence

Amazon EMR

Managed Hadoop framework for processing big data workloads

EMR is like renting a supercomputer cluster for big data processing. When you have petabytes of data to analyze, too much for a single server, you need distributed computing. EMR runs Hadoop, Spark, Presto, and other big data frameworks on clusters of EC2 instances. You submit a job (process this 10TB dataset), EMR spins up a cluster, processes the data in parallel across hundreds of nodes, and shuts down when done. It's like having a temporary army of workers who show up, complete the job, and leave; you only pay for the time they work.

EMR creates clusters with master nodes (coordinate), core nodes (run tasks and store HDFS data), and task nodes (run tasks only). You choose instance types, cluster size, and applications (Hadoop, Spark, Hive, Presto, HBase). EMR supports transient clusters (terminate after job completes) and long-running clusters.

Key Capabilities

Provisions and manages clusters of EC2 instances running Apache Spark, Hadoop, Hive, Presto, HBase, Flink, and other big data frameworks
EMR Serverless removes cluster management entirely, auto-scaling compute for Spark and Hive jobs without provisioning or resizing instances
EMR on EKS runs Spark jobs on existing EKS clusters, sharing compute resources between big data workloads and containerized applications
Instance fleets mix On-Demand and Spot instances within a cluster for cost optimization; Spot is recommended for Task nodes only to avoid data loss on interruption
EMRFS enables S3 to be used as the primary storage layer in place of HDFS, so data persists after cluster termination
EMR Studio provides a Jupyter-based notebook environment for data scientists to develop and run jobs interactively against running or serverless clusters

Gotchas & Constraints

Gotcha #1: Core nodes store HDFS data; terminating them causes data loss. Use S3 for persistent storage. Gotcha #2: EMR charges EC2 instance costs plus EMR service fees; costs can be high for large clusters. Constraints: Maximum 2,000 instances per instance group (adjustable), cluster launch takes 5-10 minutes, and spot instances can be terminated (use task nodes for spot).

A genomics research lab processes DNA sequencing data; each sample generates 100GB of raw data, and they process 1,000 samples/month (100TB total). Running this on a single server would take months. They use EMR with Spark: upload raw data to S3, launch a transient EMR cluster (100 r5.4xlarge instances), run Spark jobs to align sequences and identify variants, write results to S3, and terminate the cluster. Processing completes in 4 hours instead of 4 months. They use spot instances for task nodes (70% cost savings) and on-demand for core nodes (reliability). For interactive analysis, they use EMR Notebooks to query processed data with Spark SQL. When they need to reprocess data with updated algorithms, they launch a new cluster with the same configuration.

The Result

100x faster processing, pay-per-use pricing, and scalable to any data volume.

Official AWS Documentation