🧠AI & Machine Learning

Amazon Polly

Convert text to lifelike speech with neural text-to-speech

Polly is like having a professional voice actor who can read any text aloud in dozens of languages and voices. You give it text, and it returns audio: natural-sounding speech that sounds almost human. It supports multiple languages, accents, and speaking styles. Perfect for applications like audiobooks, voice assistants, accessibility features, or automated phone systems. Think of it as giving your application a voice; instead of users reading text, they can listen to it.

Polly uses deep learning to synthesize speech from text. You provide text (plain or SSML for pronunciation control), choose a voice (60+ voices in 30+ languages), and Polly returns audio (MP3, OGG, PCM). Polly supports multiple engines: Standard (faster, lower cost), Neural (more natural, higher quality), Long-Form (optimized for long content like articles), and Generative (highest quality, most expressive).

Key Capabilities

Converts text to speech using neural TTS (NTTS) and standard TTS voices across 30+ languages; neural voices produce significantly more natural-sounding output
Output formats include MP3, OGG, and PCM (raw audio), with configurable sample rates for different delivery targets
SSML (Speech Synthesis Markup Language) controls pronunciation, emphasis, pauses, speaking rate, pitch, and whisper effects inline within the text
Custom lexicons define pronunciations for domain-specific words (brand names, medical terms, acronyms) that the default models handle incorrectly
Speech marks return word timing and viseme metadata alongside audio, enabling lip-sync animation and karaoke-style text highlighting
Long-form synthesis (async, up to 100,000 characters, output to S3) handles full articles or documents; synchronous synthesis is limited to 3,000 characters per request

Gotchas & Constraints

Gotcha #1: Neural voices cost 4x more than standard voices; use standard for high-volume applications. Gotcha #2: Polly has character limits per request (3,000 for plain text, 6,000 for SSML, regardless of engine); split long text into chunks. Constraints: Maximum 3,000 characters per request (plain text), maximum 6,000 characters (SSML), and maximum 100,000 characters per month in free tier.

An e-learning platform wants to add audio narration to courses. Hiring voice actors for 1,000 courses would cost $500,000. They use Polly: for each course module, they send text to Polly with a neural voice (Joanna for English, Lupe for Spanish). Polly generates audio files, which they store in S3 and serve via CloudFront. They use SSML to control pronunciation of technical terms and add pauses for emphasis. For accessibility, they add a 'listen' button to every article, and clicking it triggers Polly to read the article aloud in real-time. They support 10 languages, using different Polly voices for each. They process 10 million characters/month, costing $400/month (vs. $500,000 for voice actors).

The Result

accessible content, multilingual support, and 99% cost savings.

Official AWS Documentation