🧠AI & Machine Learning

Amazon Transcribe

Automatic speech recognition service to convert audio to text

Transcribe is like having a stenographer who listens to audio and types out every word. You give it an audio file (or stream audio in real-time), and it returns text transcription. It handles multiple speakers, background noise, and technical jargon. Perfect for applications like meeting transcriptions, subtitles for videos, call center analytics, or voice-controlled applications. Think of it as giving your application the ability to hear and understand spoken language.

Transcribe converts speech to text using deep learning. You provide audio (MP3, WAV, FLAC, etc.) via S3 or streaming, and Transcribe returns text with timestamps.

Key Capabilities

Converts speech to text using deep learning ASR for both batch transcription (audio files in S3) and real-time streaming transcription (WebSocket or HTTP/2)
Speaker diarization identifies and labels distinct speakers in a recording, useful for call center transcripts and meeting recordings
Custom vocabulary improves accuracy for domain-specific terms (product names, medical terminology, jargon) not well-covered by the base model
Custom language models fine-tune ASR on a corpus of domain-specific text for sustained accuracy improvements in specialized fields
Automatic content redaction detects and masks PII (names, phone numbers, SSNs) in transcripts without a separate post-processing step
Transcribe Medical handles clinical audio with a model trained on medical terminology for use in healthcare documentation workflows

Gotchas & Constraints

Gotcha #1: Accuracy varies by audio quality; clear audio with minimal background noise has highest accuracy. Gotcha #2: Transcribe charges per second of audio, and costs can add up for long recordings. Constraints: Maximum 4 hours per audio file (batch), maximum 4 hours per stream (streaming), and maximum 2GB file size.

A call center records 10,000 customer calls daily for quality assurance. Manually transcribing calls is impossible. They use Transcribe: when a call ends, they upload the audio to S3 and trigger a Lambda function to start a Transcribe job. Transcribe identifies speakers (agent vs. customer), transcribes the conversation, and redacts PII (credit card numbers, SSNs). They store transcriptions in DynamoDB and use Comprehend to analyze sentiment and identify frustrated customers and flag calls for review. For compliance, they search transcriptions for specific phrases ('cancel my account', 'speak to a manager'). For training, they identify calls where agents didn't follow scripts. They process 10,000 hours of audio/month, costing $12,000/month (vs. $100,000 for manual transcription).

The Result

100% call transcription, automated quality assurance, and compliance monitoring.

Official AWS Documentation