Analytics & Business Intelligence
    📈Analytics & Business Intelligence

    Amazon EMR

    Managed Hadoop framework for processing big data workloads

    EMR is like renting a supercomputer cluster for big data processing. When you have petabytes of data to analyze, too much for a single server, you need distributed computing. EMR runs Hadoop, Spark, Presto, and other big data frameworks on clusters of EC2 instances. You submit a job (process this 10TB dataset), EMR spins up a cluster, processes the data in parallel across hundreds of nodes, and shuts down when done. It's like having a temporary army of workers who show up, complete the job, and leave; you only pay for the time they work.

    EMR creates clusters with master nodes (coordinate), core nodes (run tasks and store HDFS data), and task nodes (run tasks only). You choose instance types, cluster size, and applications (Hadoop, Spark, Hive, Presto, HBase). EMR supports transient clusters (terminate after job completes) and long-running clusters.

    Key Capabilities

    Key features: EMR Notebooks (Jupyter for interactive analysis), EMR Serverless (no cluster management), and integration with S3 (use S3 instead of HDFS for storage).

    Gotchas & Constraints

    Gotcha #1: Core nodes store HDFS data; terminating them causes data loss. Use S3 for persistent storage. Gotcha #2: EMR charges EC2 instance costs plus EMR service fees; costs can be high for large clusters. Constraints: Maximum 2,000 instances per instance group (adjustable), cluster launch takes 5-10 minutes, and spot instances can be terminated (use task nodes for spot).

    A genomics research lab processes DNA sequencing data; each sample generates 100GB of raw data, and they process 1,000 samples/month (100TB total). Running this on a single server would take months. They use EMR with Spark: upload raw data to S3, launch a transient EMR cluster (100 r5.4xlarge instances), run Spark jobs to align sequences and identify variants, write results to S3, and terminate the cluster. Processing completes in 4 hours instead of 4 months. They use spot instances for task nodes (70% cost savings) and on-demand for core nodes (reliability). For interactive analysis, they use EMR Notebooks to query processed data with Spark SQL. When they need to reprocess data with updated algorithms, they launch a new cluster with the same configuration.

    The Result

    100x faster processing, pay-per-use pricing, and scalable to any data volume.

    Official AWS Documentation