AI & Machine Learning
    🧠AI & Machine Learning

    Amazon Polly

    Convert text to lifelike speech with neural text-to-speech

    Polly is like having a professional voice actor who can read any text aloud in dozens of languages and voices. You give it text, and it returns audio: natural-sounding speech that sounds almost human. It supports multiple languages, accents, and speaking styles. Perfect for applications like audiobooks, voice assistants, accessibility features, or automated phone systems. Think of it as giving your application a voice; instead of users reading text, they can listen to it.

    Polly uses deep learning to synthesize speech from text. You provide text (plain or SSML for pronunciation control), choose a voice (60+ voices in 30+ languages), and Polly returns audio (MP3, OGG, PCM). Polly supports multiple engines: Standard (faster, lower cost), Neural (more natural, higher quality), Long-Form (optimized for long content like articles), and Generative (highest quality, most expressive).

    Key Capabilities

    Key features: speech marks (metadata for lip-syncing), lexicons (custom pronunciations), and SSML tags (control pitch, rate, volume, pauses).

    Gotchas & Constraints

    Gotcha #1: Neural voices cost 4x more than standard voices; use standard for high-volume applications. Gotcha #2: Polly has character limits per request (3,000 for plain text, 6,000 for SSML, regardless of engine); split long text into chunks. Constraints: Maximum 3,000 characters per request (plain text), maximum 6,000 characters (SSML), and maximum 100,000 characters per month in free tier.

    An e-learning platform wants to add audio narration to courses. Hiring voice actors for 1,000 courses would cost $500,000. They use Polly: for each course module, they send text to Polly with a neural voice (Joanna for English, Lupe for Spanish). Polly generates audio files, which they store in S3 and serve via CloudFront. They use SSML to control pronunciation of technical terms and add pauses for emphasis. For accessibility, they add a 'listen' button to every article, and clicking it triggers Polly to read the article aloud in real-time. They support 10 languages, using different Polly voices for each. They process 10 million characters/month, costing $400/month (vs. $500,000 for voice actors).

    The Result

    accessible content, multilingual support, and 99% cost savings.

    Official AWS Documentation