Machine Learning
Unit 5: Trends and Applications / Email Spam & Malware Filtering
5. Email Spam & Malware Filtering
Machine learning allows systems to dynamically adapt to new spam tactics and zero-day malware threats, replacing fragile, static signature-based rules.
The Naive Bayes Classifier
The most popular statistical algorithm for spam detection. It models the probability that an email is spam given the occurrence of specific words (Bag of Words model).
- Strengths: Extremely fast, handles high-dimensional text data well, resilient to noise.
- How it works: If the calculated probability
P(Spam | Words)exceeds a threshold, the email is flagged.
Deep Dive: Spam Detection Features
- Content-Based: Word frequencies (TF-IDF) for terms like "free", "urgent", "click here". Character features like ALL CAPS or excessive punctuation (!!!).
- Header-Based: Sender IP reputation, mismatch between 'From' and 'Reply-To' addresses, SPF/DKIM authentication failures.
- Behavioral: Send rate velocity (identifying botnets), historical user spam reports for that sender.
Malware Detection using ML
Traditional antivirus uses known signatures. ML enables detection of unseen, zero-day malware.
1. Static Analysis
Analyzes the file without executing it. Features include file size, PE header fields, imported API calls, and raw byte n-grams.
2. Dynamic Analysis
Executes the file in a secure sandbox. Monitors behavioral features: registry edits, network calls, and file system modifications (often mapped via RNNs/LSTMs).