📈Analytics & Business Intelligence

AWS Glue

Fully managed ETL service for data preparation and loading

Glue is like a data janitor that cleans, transforms, and organizes your data automatically. You have data scattered across S3, RDS, Redshift; different formats, different schemas, messy and inconsistent. Glue crawls your data sources, figures out the schema automatically, and creates a catalog. Then it runs ETL (Extract, Transform, Load) jobs to clean and transform data: convert CSV to Parquet, join tables, filter records, aggregate values, and load it into your data warehouse or data lake. It's serverless, so you don't manage servers, and it scales automatically.

Glue has three main components: Glue Data Catalog (metadata repository), Glue Crawlers (discover schema), and Glue ETL Jobs (transform data). Crawlers scan data sources (S3, RDS, Redshift) and populate the Data Catalog with table definitions. ETL jobs are written in Python or Scala (using Spark), or you can use Glue Studio (visual ETL). Jobs run on serverless Spark clusters; you specify DPUs (Data Processing Units) for capacity.

Key Capabilities

Data Catalog stores centralized metadata (tables, schemas, partitions) used by Athena, EMR, and Redshift Spectrum as a shared schema repository
Crawlers automatically discover and infer schemas from data in S3, RDS, and DynamoDB, keeping the Data Catalog up to date without manual table definitions
ETL Jobs run Spark-based or Python Shell scripts on managed infrastructure; Glue Studio provides a visual drag-and-drop ETL builder for no-code transformations
Job bookmarks track which data has already been processed so incremental ETL runs only handle new data, avoiding reprocessing
Glue DataBrew provides visual, no-code data profiling and transformation for analysts without Spark knowledge
Dynamic frames extend Spark DataFrames with flexible schema handling for inconsistent or evolving source data structures

Gotchas & Constraints

Gotcha #1: Glue charges per DPU-hour and per crawler run; costs can add up for frequent crawls or long-running jobs. Gotcha #2: Glue ETL uses Spark; learning curve for developers unfamiliar with Spark. Constraints: Maximum 2,000 concurrent job runs per account (adjustable), maximum 10 million partitions per table, and job timeout maximum 48 hours.

A retail company has data in multiple sources: sales transactions in RDS, clickstream logs in S3 (JSON), and inventory data in Redshift. They need to combine this data for analytics. They create Glue crawlers for each source; crawlers discover schemas and populate the Glue Data Catalog. They create a Glue ETL job: read sales from RDS, read clickstreams from S3, join them on customer_id, aggregate by product and date, convert to Parquet format, and write to S3 data lake. The job runs nightly, processing 10TB of data in 2 hours using 50 DPUs. Athena queries the Parquet files in S3 (using Glue Data Catalog for schema), and QuickSight visualizes the data. When they add a new data source (customer reviews in DynamoDB), they create a new crawler and update the ETL job; no infrastructure changes needed.

The Result

unified data lake, automated ETL, and fast analytics queries.

Official AWS Documentation