Machine Learning
Supervised Learning: Classification / K-Nearest Neighbour (KNN)
3. K-Nearest Neighbour (KNN) Algorithm
KNN is a non-parametric, instance-based, lazy learning algorithm. It does not build an explicit model during training; instead, it memorizes the training dataset and makes predictions at test time by looking at the K closest data points.
Working of KNN
- Load the training data and choose the value of K.
- For a new test point, calculate the distance to all training points using a distance metric (e.g., Euclidean).
- Select the K nearest neighbours (smallest distances).
- For classification: assign the class that appears most frequently among the K neighbours (majority voting).
- For regression: compute the mean/average of K neighbours' target values.
Distance Metrics
To find the "nearest" neighbours, the algorithm relies on distance calculations. The two most common metrics are:
1. Euclidean Distance (L2 Norm)
The straight-line distance between two points. Best for continuous real-valued features.
2. Manhattan Distance (L1 Norm)
The distance between two points measured along axes at right angles (like a taxi navigating city blocks). Best for high-dimensional or categorical data.
Choosing the Value of K
The choice of K is critical and has a significant impact on model performance:
- Small K (e.g., K=1): Model is very sensitive to noise (overfitting). High variance.
- Large K: Model is too smooth and may miss local patterns (underfitting). High bias.
Best Practices:
- Rule of thumb: K = √N where N is the number of training samples.
- Use cross-validation to select the optimal K.
- Choose an odd K for binary classification to avoid ties in majority voting.
Validation Error in KNN
Validation error measures how well the model generalizes to unseen data. As K increases:
- Training error increases (model becomes less flexible).
- Validation error first decreases (underfitting reduced), then increases (bias increases).
- Optimal K is at the point where validation error is minimum (the U-shaped curve).
Inductive Bias in KNN
Inductive bias refers to the assumptions a learning algorithm uses to generalize beyond the training data. For KNN, the inductive bias is: "similar inputs have similar outputs" — i.e., nearby points in feature space are likely to belong to the same class.