Machine Learning

Unit 5: Trends and Applications / Email Spam & Malware Filtering


5. Email Spam & Malware Filtering

Machine learning allows systems to dynamically adapt to new spam tactics and zero-day malware threats, replacing fragile, static signature-based rules.

The Naive Bayes Classifier

The most popular statistical algorithm for spam detection. It models the probability that an email is spam given the occurrence of specific words (Bag of Words model).

P(Spam | Email) ∝ P(Email | Spam) × P(Spam)
  • Strengths: Extremely fast, handles high-dimensional text data well, resilient to noise.
  • How it works: If the calculated probability P(Spam | Words) exceeds a threshold, the email is flagged.
Deep Dive: Spam Detection Features
  • Content-Based: Word frequencies (TF-IDF) for terms like "free", "urgent", "click here". Character features like ALL CAPS or excessive punctuation (!!!).
  • Header-Based: Sender IP reputation, mismatch between 'From' and 'Reply-To' addresses, SPF/DKIM authentication failures.
  • Behavioral: Send rate velocity (identifying botnets), historical user spam reports for that sender.

Malware Detection using ML

Traditional antivirus uses known signatures. ML enables detection of unseen, zero-day malware.

1. Static Analysis

Analyzes the file without executing it. Features include file size, PE header fields, imported API calls, and raw byte n-grams.

2. Dynamic Analysis

Executes the file in a secure sandbox. Monitors behavioral features: registry edits, network calls, and file system modifications (often mapped via RNNs/LSTMs).