Analytics & Business Intelligence
    📈Analytics & Business Intelligence

    Amazon Athena

    Query data in S3 using standard SQL without infrastructure

    Athena is like having a SQL database, except your data stays in S3 and you don't manage any servers. You have terabytes of data in S3 (logs, CSV files, Parquet files), and you want to query it with SQL. Athena lets you run SQL queries directly on S3 data: no loading, no ETL, no infrastructure. You pay only for the data scanned by your queries. It's like having a librarian who can instantly search through millions of books without moving them off the shelves. Perfect for ad-hoc analysis and interactive queries on data lakes.

    Athena is a serverless query service using Presto engine. You define tables in the Glue Data Catalog (or Athena's internal catalog) pointing to S3 locations. Athena reads data from S3, executes SQL queries, and returns results. Supports multiple formats: CSV, JSON, Parquet, ORC, Avro.

    Key Capabilities

    Key features: partitioning (reduce data scanned), compression (reduce costs), CTAS (Create Table As Select for ETL), and federated queries (query RDS, DynamoDB, on-premises databases).

    Gotchas & Constraints

    Gotcha #1: Athena charges per TB scanned; use Parquet/ORC and partitioning to minimize costs. Gotcha #2: Athena has query timeout (30 minutes) and result size limits (use CTAS for large results). Constraints: Maximum 200 concurrent DML queries per account in major regions (as low as 20 in smaller regions, request increase), maximum 10,000 databases per catalog, and query results stored in S3 (you pay S3 storage costs).

    A SaaS company stores application logs in S3: 10TB/day of JSON files. They need to analyze logs for troubleshooting and analytics. Loading logs into Redshift would cost $10,000/month and require ETL pipelines. Instead, they use Athena: create a Glue crawler to discover log schema, partition data by date (year/month/day), and convert JSON to Parquet (10x compression, 10x faster queries). They run queries: 'show all 500 errors in the last hour' or 'count requests by endpoint and status code.' Queries scan only relevant partitions (1 day = 10GB instead of 10TB), costing $0.05 per query. For dashboards, they use QuickSight connected to Athena. For alerts, they schedule Athena queries via Lambda and send SNS notifications if error rates exceed thresholds.

    The Result

    $50/month query costs (vs. $10,000 for Redshift), no infrastructure management, and instant insights from S3 data.

    Official AWS Documentation