In pattern recognition, k-nearest neighbors (k-NN) is a classification algorithm used to classify example based on a set of already classified examples. Algorithm: A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function. If K = 1, then the case is simply assigned to the class of its nearest neighbor.
The idea of the k-nearest neighbors (k-NN) algorithm is using the features of an example - which are known, to determine the classification of it - which is unknown.
First, some classified samples are supplied to the algorithm. When a new non-classified sample is given, the algorithm finds the k-nearest neighbors to the new sample, and determines what should its classification be, according to the classification of the classified samples, which were given as training set.
The algorithm is sometimes called lazy classification because during "learning" it does nothing - just stores the samples, and all the work is done during classification.
Algorithm
The k-NN algorithm is among the simplest of all machine learning algorithms. A shortcoming of the k-NN algorithm is that it is sensitive to the local structure of the data.
The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.
In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point.
A commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, such as the overlap metric (or Hamming distance).
Often, the classification accuracy of k-NN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighbourhood components analysis.