Amazon Comprehend
Natural language processing service for text analysis and insights
Comprehend is like having a linguist who can read and understand text at scale. You give it documents (customer reviews, support tickets, social media posts) and it extracts insights: sentiment (positive/negative), entities (people, places, organizations), key phrases, language, and topics. It's pre-trained on massive text datasets, so you don't need NLP expertise. Perfect for analyzing customer feedback, categorizing documents, or extracting information from unstructured text. Think of it as giving your application the ability to read and understand human language.
Comprehend provides APIs for text analysis: detect sentiment (positive, negative, neutral, mixed), extract entities (person, location, organization, date, etc.), detect key phrases, identify language, and analyze syntax (parts of speech). Comprehend also supports custom entity recognition, letting you train models to detect your specific entities (product names, internal codes).
Key Capabilities
- Pre-trained APIs cover sentiment detection, entity recognition (people, places, organizations, dates, quantities), key phrase extraction, language detection (102 languages), syntax analysis, and PII identification
- Comprehend Medical is a separate variant trained on clinical text for extracting medical conditions, medications, dosages, and procedures from unstructured health records
- Custom Classification trains a document classifier on your own labeled categories for domain-specific routing or tagging tasks
- Custom Entity Recognition trains a named entity recognition (NER) model on domain-specific entity types not covered by the pre-trained general models
- Async batch processing submits large document sets stored in S3 as a single job, returning results to S3 without managing individual API calls per document
- Flywheel automates continuous model retraining as new labeled data arrives, keeping custom models up to date without manual retraining pipelines
Gotchas & Constraints
Gotcha #1: Comprehend charges per unit (100 characters), and costs can add up for large text volumes. Gotcha #2: Accuracy varies by language; English has highest accuracy, other languages may be lower. Constraints: Maximum 5,000 bytes per document (synchronous), maximum 100MB per document (asynchronous), and custom models require minimum 1,000 training documents.
An e-commerce company receives 100,000 customer reviews monthly. Manually reading them is impossible, but they need to understand customer sentiment and identify issues. They use Comprehend: for each review, they call detect sentiment API to classify as positive, negative, or neutral. They aggregate sentiment by product and identify products with declining sentiment. They use extract entities to identify mentioned features ('battery life', 'screen quality', 'customer service'). They use detect key phrases to find common complaints ('shipping delay', 'defective product'). For support tickets, they use custom entity recognition to extract order numbers, product SKUs, and issue types, automatically routing tickets to the right team. They process all reviews in real-time, creating dashboards showing sentiment trends and top issues.
The Result
proactive issue identification, data-driven product improvements, and 80% faster issue resolution.