AWS Glue
Fully managed ETL service for data preparation and loading
Glue is like a data janitor that cleans, transforms, and organizes your data automatically. You have data scattered across S3, RDS, Redshift; different formats, different schemas, messy and inconsistent. Glue crawls your data sources, figures out the schema automatically, and creates a catalog. Then it runs ETL (Extract, Transform, Load) jobs to clean and transform data: convert CSV to Parquet, join tables, filter records, aggregate values, and load it into your data warehouse or data lake. It's serverless, so you don't manage servers, and it scales automatically.
Glue has three main components: Glue Data Catalog (metadata repository), Glue Crawlers (discover schema), and Glue ETL Jobs (transform data). Crawlers scan data sources (S3, RDS, Redshift) and populate the Data Catalog with table definitions. ETL jobs are written in Python or Scala (using Spark), or you can use Glue Studio (visual ETL). Jobs run on serverless Spark clusters; you specify DPUs (Data Processing Units) for capacity.
Key Capabilities
Key features: job bookmarks (track processed data), triggers (schedule or event-based), and development endpoints (test jobs interactively).
Gotchas & Constraints
Gotcha #1: Glue charges per DPU-hour and per crawler run; costs can add up for frequent crawls or long-running jobs. Gotcha #2: Glue ETL uses Spark; learning curve for developers unfamiliar with Spark. Constraints: Maximum 2,000 concurrent job runs per account (adjustable), maximum 10 million partitions per table, and job timeout maximum 48 hours.
A retail company has data in multiple sources: sales transactions in RDS, clickstream logs in S3 (JSON), and inventory data in Redshift. They need to combine this data for analytics. They create Glue crawlers for each source; crawlers discover schemas and populate the Glue Data Catalog. They create a Glue ETL job: read sales from RDS, read clickstreams from S3, join them on customer_id, aggregate by product and date, convert to Parquet format, and write to S3 data lake. The job runs nightly, processing 10TB of data in 2 hours using 50 DPUs. Athena queries the Parquet files in S3 (using Glue Data Catalog for schema), and QuickSight visualizes the data. When they add a new data source (customer reviews in DynamoDB), they create a new crawler and update the ETL job; no infrastructure changes needed.
The Result
unified data lake, automated ETL, and fast analytics queries.