Return to page

WIKI

K-Nearest Neighbor (kNN)

What Is K-nearest neighbor (kNN)?

The k-nearest neighbors (kNN) algorithm is a widely used machine learning method for classification and regression problems. The algorithm operates on a simple principle: It uses the 'k' nearest data points in the training dataset to predict the output for a new, previously unseen instance. The 'k' in kNN stands for the number of nearest neighbors the algorithm considers when making its prediction.

History and Background of K-nearest neighbor

The roots of the kNN algorithm trace back to statistical decision theory developed in the early 20th century. However, the formal introduction of kNN as a machine learning algorithm is attributed to Evelyn Fix and Joseph Hodges who presented it in 1951. With the growth of computer processing power and data availability, kNN has been increasingly used due to its simplicity, ease of understanding, and robust performance in various tasks.

 

At the heart of the kNN algorithm lies a simple premise: similarity breeds proximity. For a classification problem, the algorithm predicts the class of a new instance based on the majority class among its 'k' nearest neighbors. For a regression problem, it predicts the output based on the average (or weighted average) of the 'k' nearest neighbors.

 

The concept of 'nearest' involves a distance metric, such as the Euclidean or Manhattan distance. The choice of distance metric can significantly influence the performance of the algorithm, as can the choice of 'k'.

Algorithm Description

The kNN algorithm operates in the following steps:

 

  • Calculate the distance from the new instance to all instances in the training set.

  • Sort the distances and determine the 'k' nearest instances.

  • If kNN is being used for classification, assign the class that occurs most frequently among the 'k' nearest instances to the new instance.

  • If kNN is used for regression, assign the average (or weighted average) of the 'k' nearest instances to the new instance.

 

The algorithm does not learn a model from the training data, but instead uses the training data during the prediction phase. Because of this, kNN is referred to as an instance-based or memory-based algorithm.

Hyperparameters in kNN

Two key hyperparameters in the kNN algorithm are 'k' (the number of neighbors) and the distance metric used.

 

The choice of 'k' influences the bias-variance trade-off in the algorithm's predictions. A small 'k' provides a more flexible fit to the data, but can be more sensitive to noise. Conversely, a large 'k' provides a smoother, less variable fit, but can bias the predictions too much towards the majority class.

 

Different distance metrics can capture other notions of similarity, depending on the problem at hand. Common choices include the Euclidean, Manhattan, and Minkowski distances.

Advantages and Disadvantages of kNN

The advantages of kNN include its simplicity, intuitiveness, and ease of implementation. It makes no assumptions about the underlying data distribution, making it highly flexible.

 

However, kNN has its drawbacks. It can be computationally expensive for large datasets, as it requires calculating distances to all training instances for each prediction. It is sensitive to the scale of the features, so feature scaling is often necessary. It also suffers from the "curse of dimensionality", performing poorly when the number of features is large.

Variations and Related Algorithms

Variations of the kNN algorithm include weighted kNN, where closer instances are given more weight in the decision; and the radius-based neighbor learning method, where all instances within a certain radius are considered instead of considering a fixed number 'k'.

 

Related algorithms include the nearest centroid classifier and the nearest subspace classifier. These methods represent each class by a single point or a subspace, reducing computational costs but potentially sacrificing some accuracy.