Machine Learning

Unsupervised Learning / Anomaly Detection


6. Anomaly Detection Algorithm

Anomaly Detection (or Outlier Detection) is the process of identifying data points or events that deviate significantly from the expected pattern or normal behavior.

Types of Anomalies

  • Point Anomalies: A single instance is anomalous relative to the rest of the data (e.g., a huge credit card purchase).
  • Contextual Anomalies: Normal under some contexts but anomalous in others (e.g., buying a snow shovel in July).
  • Collective Anomalies: A collection of related data instances is anomalous, even if individual points are not (e.g., rapid sequential login attempts).

Detection Approaches

1. Statistical Methods (Z-Score)

Assumes normal data follows a Gaussian distribution.

Z = |(x - μ) / σ|

If Z > 3, the point is flagged as an anomaly.

2. Density-Based (DBSCAN)

Groups dense regions into clusters. Points are classified as Core, Border, or Noise.

Noise points (not within ε-radius of any core point) are anomalies.

3. Distance-Based (LOF - Local Outlier Factor)

Compares the local density of a point to the local densities of its neighbors. A point with a much lower density than its neighbors is an anomaly.

Machine Learning Approach: Isolation Forest

Isolation Forest is specifically designed for anomaly detection. It isolates anomalies rather than profiling normal points.

  1. Build multiple random decision trees (isolation trees).
  2. At each node, randomly select a feature and a random split value between the min and max.
  3. Repeat until each point is isolated in its own leaf node.
The Core Logic

Because anomalies are rare and different, they require fewer splits to be isolated. Therefore, points with a shorter path length from the root to the leaf are scored as anomalies.