Machine Learning

Supervised Learning: Classification / K-Nearest Neighbour (KNN)


3. K-Nearest Neighbour (KNN) Algorithm

KNN is a non-parametric, instance-based, lazy learning algorithm. It does not build an explicit model during training; instead, it memorizes the training dataset and makes predictions at test time by looking at the K closest data points.

Working of KNN

  1. Load the training data and choose the value of K.
  2. For a new test point, calculate the distance to all training points using a distance metric (e.g., Euclidean).
  3. Select the K nearest neighbours (smallest distances).
  4. For classification: assign the class that appears most frequently among the K neighbours (majority voting).
  5. For regression: compute the mean/average of K neighbours' target values.

Distance Metrics

To find the "nearest" neighbours, the algorithm relies on distance calculations. The two most common metrics are:

1. Euclidean Distance (L2 Norm)

The straight-line distance between two points. Best for continuous real-valued features.

d(x, y) = √[ Σ(xᵢ - yᵢ)² ]
2. Manhattan Distance (L1 Norm)

The distance between two points measured along axes at right angles (like a taxi navigating city blocks). Best for high-dimensional or categorical data.

d(x, y) = Σ |xᵢ - yᵢ|

Choosing the Value of K

The choice of K is critical and has a significant impact on model performance:

  • Small K (e.g., K=1): Model is very sensitive to noise (overfitting). High variance.
  • Large K: Model is too smooth and may miss local patterns (underfitting). High bias.

Best Practices:

  • Rule of thumb: K = √N where N is the number of training samples.
  • Use cross-validation to select the optimal K.
  • Choose an odd K for binary classification to avoid ties in majority voting.

Validation Error in KNN

Validation error measures how well the model generalizes to unseen data. As K increases:

  • Training error increases (model becomes less flexible).
  • Validation error first decreases (underfitting reduced), then increases (bias increases).
  • Optimal K is at the point where validation error is minimum (the U-shaped curve).

Inductive Bias in KNN

Inductive bias refers to the assumptions a learning algorithm uses to generalize beyond the training data. For KNN, the inductive bias is: "similar inputs have similar outputs" — i.e., nearby points in feature space are likely to belong to the same class.