Machine Learning

Unit 5: Trends and Applications / Speech Recognition

3. Speech Recognition

Also known as Automatic Speech Recognition (ASR) or Speech-to-Text (STT), this is the process of converting spoken language audio into machine-readable text.

Types of ASR

Speaker-Dependent vs Independent: Trained for a specific voice (Dragon Dictate) vs any voice (Siri).
Continuous vs Isolated: Flowing natural speech vs single words with pauses.
Keyword Spotting: Detecting specific trigger words ("Hey Siri").

Deep Dive: The Classic HMM-Based Pipeline

Audio Capture: Microphone captures analog waves; converted to digital (typically 16 kHz).
Feature Extraction (MFCC): Mel-Frequency Cepstral Coefficients extract relevant human-perceivable frequencies, discarding background noise.
Acoustic Model (HMM): Hidden Markov Models map the audio features to specific phonemes (sounds).
Pronunciation Dictionary: Maps sequences of phonemes to actual words.
Language Model: Assigns probabilities to word sequences to ensure grammatically correct sentences (e.g., differentiating "I ate" vs "I eight").
Decoder: Outputs the final, highest-probability text transcript.

Modern Deep Learning Approach

Modern ASR replaces the complex multi-step HMM pipeline with End-to-End Sequence-to-Sequence models. Technologies like LSTMs and Transformers (e.g., OpenAI's Whisper) directly map audio wave sequences to text text tokens using attention mechanisms and CTC (Connectionist Temporal Classification) to handle alignment.

Challenges

Accents & Dialects: High regional pronunciation variability.
Background Noise: Cocktail party problem (distinguishing primary speaker from ambient noise).
Homophones: Words that sound identical but mean different things (their, there, they're).

Image Recognition Prediction & Recommendation