📈Analytics & Business Intelligence

Amazon Athena

Query data in S3 using standard SQL without infrastructure

Athena is like having a SQL database, except your data stays in S3 and you don't manage any servers. You have terabytes of data in S3 (logs, CSV files, Parquet files), and you want to query it with SQL. Athena lets you run SQL queries directly on S3 data: no loading, no ETL, no infrastructure. You pay only for the data scanned by your queries. It's like having a librarian who can instantly search through millions of books without moving them off the shelves. Perfect for ad-hoc analysis and interactive queries on data lakes.

Athena is a serverless query service using Presto engine. You define tables in the Glue Data Catalog (or Athena's internal catalog) pointing to S3 locations. Athena reads data from S3, executes SQL queries, and returns results. Supports multiple formats: CSV, JSON, Parquet, ORC, Avro.

Key Capabilities

Runs standard SQL queries directly against data in S3 (CSV, JSON, Parquet, ORC, Avro, and more) using a Presto/Trino engine with no infrastructure to provision
Charged per TB of data scanned; storing data in columnar formats (Parquet, ORC) and partitioning by common query dimensions can cut scanned data and cost by up to 90%
Tables are defined in the Glue Data Catalog, enabling schema sharing with EMR, Glue ETL jobs, and Redshift Spectrum
Federated queries extend Athena to query RDS, DynamoDB, Redshift, and other data sources alongside S3 in a single SQL statement
CTAS (Create Table As Select) writes query results back to S3 as a new table, supporting incremental ETL and result materialization
Workgroups enforce per-team query data scan limits and costs, and can require result encryption for compliance

Gotchas & Constraints

Gotcha #1: Athena charges per TB scanned; use Parquet/ORC and partitioning to minimize costs. Gotcha #2: Athena has query timeout (30 minutes) and result size limits (use CTAS for large results). Constraints: Maximum 200 concurrent DML queries per account in major regions (as low as 20 in smaller regions, request increase), maximum 10,000 databases per catalog, and query results stored in S3 (you pay S3 storage costs).

A SaaS company stores application logs in S3: 10TB/day of JSON files. They need to analyze logs for troubleshooting and analytics. Loading logs into Redshift would cost $10,000/month and require ETL pipelines. Instead, they use Athena: create a Glue crawler to discover log schema, partition data by date (year/month/day), and convert JSON to Parquet (10x compression, 10x faster queries). They run queries: 'show all 500 errors in the last hour' or 'count requests by endpoint and status code.' Queries scan only relevant partitions (1 day = 10GB instead of 10TB), costing $0.05 per query. For dashboards, they use QuickSight connected to Athena. For alerts, they schedule Athena queries via Lambda and send SNS notifications if error rates exceed thresholds.

The Result

$50/month query costs (vs. $10,000 for Redshift), no infrastructure management, and instant insights from S3 data.

Official AWS Documentation