Machine Learning
Unit 5: Trends and Applications / Speech Recognition
3. Speech Recognition
Also known as Automatic Speech Recognition (ASR) or Speech-to-Text (STT), this is the process of converting spoken language audio into machine-readable text.
Types of ASR
- Speaker-Dependent vs Independent: Trained for a specific voice (Dragon Dictate) vs any voice (Siri).
- Continuous vs Isolated: Flowing natural speech vs single words with pauses.
- Keyword Spotting: Detecting specific trigger words ("Hey Siri").
Deep Dive: The Classic HMM-Based Pipeline
- Audio Capture: Microphone captures analog waves; converted to digital (typically 16 kHz).
- Feature Extraction (MFCC): Mel-Frequency Cepstral Coefficients extract relevant human-perceivable frequencies, discarding background noise.
- Acoustic Model (HMM): Hidden Markov Models map the audio features to specific phonemes (sounds).
- Pronunciation Dictionary: Maps sequences of phonemes to actual words.
- Language Model: Assigns probabilities to word sequences to ensure grammatically correct sentences (e.g., differentiating "I ate" vs "I eight").
- Decoder: Outputs the final, highest-probability text transcript.
Modern Deep Learning Approach
Modern ASR replaces the complex multi-step HMM pipeline with End-to-End Sequence-to-Sequence models. Technologies like LSTMs and Transformers (e.g., OpenAI's Whisper) directly map audio wave sequences to text text tokens using attention mechanisms and CTC (Connectionist Temporal Classification) to handle alignment.
Challenges
- Accents & Dialects: High regional pronunciation variability.
- Background Noise: Cocktail party problem (distinguishing primary speaker from ambient noise).
- Homophones: Words that sound identical but mean different things (their, there, they're).