Machine Learning

Supervised Learning: Classification / K-Nearest Neighbour (KNN)

3. K-Nearest Neighbour (KNN) Algorithm

KNN is a non-parametric, instance-based, lazy learning algorithm. It does not build an explicit model during training; instead, it memorizes the training dataset and makes predictions at test time by looking at the K closest data points.

Working of KNN

Load the training data and choose the value of K.
For a new test point, calculate the distance to all training points using a distance metric (e.g., Euclidean).
Select the K nearest neighbours (smallest distances).
For classification: assign the class that appears most frequently among the K neighbours (majority voting).
For regression: compute the mean/average of K neighbours' target values.

Distance Metrics

To find the "nearest" neighbours, the algorithm relies on distance calculations. The two most common metrics are:

1. Euclidean Distance (L2 Norm)

The straight-line distance between two points. Best for continuous real-valued features.

d(x, y) = √[ Σ(xᵢ - yᵢ)² ]

2. Manhattan Distance (L1 Norm)

The distance between two points measured along axes at right angles (like a taxi navigating city blocks). Best for high-dimensional or categorical data.

d(x, y) = Σ |xᵢ - yᵢ|

Choosing the Value of K

The choice of K is critical and has a significant impact on model performance:

Small K (e.g., K=1): Model is very sensitive to noise (overfitting). High variance.
Large K: Model is too smooth and may miss local patterns (underfitting). High bias.

Best Practices:

Rule of thumb: K = √N where N is the number of training samples.
Use cross-validation to select the optimal K.
Choose an odd K for binary classification to avoid ties in majority voting.

Validation Error in KNN

Validation error measures how well the model generalizes to unseen data. As K increases:

Training error increases (model becomes less flexible).
Validation error first decreases (underfitting reduced), then increases (bias increases).
Optimal K is at the point where validation error is minimum (the U-shaped curve).

Inductive Bias in KNN

Inductive bias refers to the assumptions a learning algorithm uses to generalize beyond the training data. For KNN, the inductive bias is: "similar inputs have similar outputs" — i.e., nearby points in feature space are likely to belong to the same class.

Logistic Regression Naive Bayes