AWS Step Functions
Orchestration service for coordinating distributed applications
Step Functions is like a conductor for an orchestra; it coordinates multiple services to work together in a specific sequence. Imagine a workflow: upload file to S3, trigger Lambda to process it, store results in DynamoDB, send notification via SNS, and if anything fails, retry or send an error alert. Step Functions orchestrates this entire workflow, handling retries, error handling, and state management. You define the workflow as a state machine (visual diagram), and Step Functions executes it, tracking progress and handling failures. It's perfect for complex, multi-step processes that involve multiple AWS services.
Step Functions executes state machines defined in Amazon States Language (JSON). States include Task (invoke Lambda, ECS, Batch, etc.), Choice (branching logic), Parallel (execute branches concurrently), Wait (delay), Succeed/Fail (terminal states), and Map (iterate over array). Step Functions offers two workflow types: Standard (long-running, exactly-once execution, full execution history) and Express (high-volume, at-least-once execution, limited history).
Key Capabilities
- Orchestrates workflows as state machines defined in Amazon States Language (JSON), with states for Task, Choice, Parallel, Map, Wait, Pass, and Fail
- Task states integrate with Lambda, ECS, SNS, SQS, DynamoDB, and 200+ AWS SDK APIs directly, without requiring a Lambda wrapper for every step
- Standard workflows provide exactly-once execution, durable state, auditable history per step (with full input/output), and execution durations up to one year
- Express workflows support high-throughput, short-duration workloads (up to 5 minutes) at significantly lower cost, billed by execution duration rather than per state transition
- Each state supports Retry and Catch error handling, allowing granular retry logic with exponential backoff and fallback states per error type
- Workflow Studio provides a visual drag-and-drop builder for constructing and visualizing state machines without writing States Language JSON by hand
Gotchas & Constraints
Gotcha #1: Step Functions charges per state transition; complex workflows with many states can be expensive. Gotcha #2: Standard workflows have 1-year maximum execution time; Express workflows limited to 5 minutes. Constraints: Maximum 25,000 events in execution history (Standard), maximum 256KB input/output per state, and maximum 25 parallel branches.
A video processing pipeline: user uploads video to S3, transcode to multiple resolutions, generate thumbnails, extract metadata, update database, and send notification. Previously, they used Lambda with SQS; complex error handling, hard to track progress, and 15-minute Lambda timeout was problematic for large videos. They implement Step Functions: S3 upload triggers state machine. First state invokes Lambda to validate video. Second state runs parallel branches: one invokes MediaConvert for transcoding (30 minutes), another invokes Lambda for thumbnail generation. Third state waits for both to complete, then invokes Lambda to extract metadata and update DynamoDB. Final state sends SNS notification. If transcoding fails, Step Functions retries 3 times with exponential backoff. If it still fails, it sends an error notification and marks the workflow as failed. They monitor all executions in Step Functions console: see which step failed, view input/output of each state, and replay failed executions.
The Result
reliable video processing, easy error handling, and full visibility into workflow execution.