Machine Learning
Unit 6: Advanced Topics & MLOps (Bonus) / Natural Language Processing
3. Natural Language Processing (NLP)
NLP bridges the gap between human communication and computer understanding. Since machine learning models only understand numbers, text data must go through rigorous preprocessing and vectorization.
1. Text Preprocessing Pipeline
Tokenization
Splitting sentences into individual words or sub-words (tokens).
Stop Words Removal
Removing extremely common words (e.g., "the", "is", "at") that carry little semantic meaning.
Stemming
Chopping off prefixes/suffixes to get a root (e.g., "running" → "run"). Very fast but often creates non-words.
Lemmatization
Using a dictionary to safely reduce words to their base linguistic form (e.g., "better" → "good"). Slower but highly accurate.
2. Word Embeddings
Converting tokens into dense numerical vectors where semantic meaning is preserved.
- Bag of Words (BoW) & TF-IDF: Sparse vectors representing exact word counts. Cannot capture context or word ordering.
- Word2Vec / GloVe: Dense vectors (e.g., 300 dimensions). Understands that "King" is to "Queen" what "Man" is to "Woman". Captures deep semantic relationships.
Deep Dive: Attention and Transformers
Prior to 2017, NLP relied on RNNs and LSTMs which processed text sequentially (slow, and forgot early words in long paragraphs).
- The Attention Mechanism: Allows a model to dynamically look at ALL words in a sentence simultaneously and weigh their importance when understanding a specific word. Example: In "The animal didn't cross the street because it was too tired", Attention helps the model know "it" refers to "animal", not "street".
- Transformers: An architecture (introduced in the paper "Attention Is All You Need") that relies entirely on Attention, ditching RNNs. Highly parallelizable.
- LLMs (Large Language Models): Models like GPT (Generative Pre-trained Transformer) scale this architecture to hundreds of billions of parameters.