🗄️Database

Amazon Redshift

Fully managed petabyte-scale data warehouse service

Redshift is like a massive warehouse optimized for analyzing huge amounts of data. While regular databases (like RDS) are great for transactional workloads (insert this order, update that user), Redshift is built for analytical queries, such as 'show me total sales by region for the last 5 years' or 'which products are most popular among customers aged 25-35?' It stores data in columns instead of rows, which makes aggregations blazingly fast. Think of it as the difference between a retail store (RDS) where you handle individual transactions, and a data analysis center (Redshift) where you crunch numbers on millions of records at once.

Redshift is a columnar data warehouse that uses massively parallel processing (MPP) to distribute queries across multiple nodes. Data is stored in columns (not rows), compressed, and distributed across compute nodes using distribution keys.

Key Capabilities

Columnar MPP (massively parallel processing) storage: data is distributed across compute node slices and queries execute in parallel, suited for large-scale analytics aggregations
Redshift Spectrum: run SQL queries directly against data stored in S3 without loading it into Redshift, using external tables in the Glue Data Catalog
Concurrency Scaling: automatically adds transient cluster capacity to handle spikes in concurrent queries, then removes it when demand drops
Materialized views: precompute and store query results for frequently run aggregations, with automatic or manual refresh
AQUA (Advanced Query Accelerator): hardware-accelerated cache layer for ra3 instances that offloads data-intensive operations from compute nodes
Automatic Workload Management (WLM): classifies and prioritizes queries into queues to prevent long-running analytics jobs from starving short interactive queries

Gotchas & Constraints

Gotcha #1: Redshift is optimized for analytical queries, not transactional workloads; don't use it as an OLTP database. Gotcha #2: Poor distribution keys cause data skew (some nodes store more data than others), killing performance. Choose high-cardinality keys. Constraints: Multi-AZ is supported for RA3 clusters (99.99% SLA), but DC2 clusters are Single-AZ only, maintenance windows require planning, and VACUUM operations (reclaim space, sort data) can impact performance.

A retail company analyzes 10 years of sales data (50TB) to identify trends and optimize inventory. Querying this data in RDS PostgreSQL takes hours and maxes out CPU. They migrate to Redshift, loading data from S3 using COPY commands. Redshift distributes data across 10 nodes, and queries that took 2 hours now complete in 30 seconds. They use Redshift Spectrum to query an additional 200TB of historical data in S3 without loading it, combining warehouse and data lake queries. For real-time analytics, they stream data from Kinesis Data Firehose into Redshift every 5 minutes. Business analysts use QuickSight to visualize data, running complex queries (joins across 5 tables, aggregations over billions of rows) that return in seconds. They enable Concurrency Scaling to handle 50 concurrent users during month-end reporting without performance degradation.

The Result

100x faster queries, $50,000/month cost savings vs. scaling RDS, and self-service analytics for business users.

Official AWS Documentation