Deep Dive on KNN: Understanding and Implementing the K-Nearest Neighbors Algorithm
Introduction
The K Nearest Neighbor (KNN) algorithm is a simple, non-parametric machine learning algorithm used for both classification and regression tasks. It is based on the assumption that similar items are close to each other in a feature space. KNN works by finding the k-nearest neighbors to a given query point, and predicting the class or value of the query point based on the classes or values of its neighbors.
This blog post aims to provide a technical overview of the KNN algorithm, and to explain its practical implementation, applications, and limitations. We will also cover some techniques to improve the performance of the KNN algorithm.
What is the K Nearest Neighbor Algorithm?
Definition of KNN
The KNN algorithm is a type of instance-based learning, or lazy learning. It involves storing all available cases and classifying new cases based on a similarity measure (e.g., distance functions). The basic idea of KNN is to find the k-nearest neighbors to a given query point and use their class labels (in the case of classification) or their values (in the case of regression) to make a prediction for the query point.
How It Works
KNN works in three main steps: (1) calculating the distance between the query point and each training point, (2) selecting the k-nearest neighbors to the query point, and (3) predicting the class or value of the query point based on the majority class or the mean value of the neighbors, respectively. The choice of the distance metric and the value of k are important parameters in the KNN algorithm.
There are many different distance metrics you can choose from, with the most common being Euclidean distance. Other possible distance metrics include Manhattan distance, Minkowski distance, Hamming distance, and Cosine distance. These distance metrics define the decision boundaries for the corresponding partitions of data. The “right” distance metric to choose depends on the problem at hand.
With respect to choosing the appropriate value of k, this will largely depend on the input dataset. The k value determines the number of data points that the algorithm considers when predicting the value/label of a new data point. A larger value of k considers more data points, resulting in a smoother decision boundary, but may lead to underfitting. A smaller value of k considers fewer data points, resulting in a more complex decision boundary and may lead to overfitting. We dive deeper into choosing the value of k in the section on implementation KNN algorithm below.
Applications
Image Classification
The KNN algorithm can be used in image classification tasks by representing each image as a feature vector and then applyig the KNN algorithm to the set of feature vectors.
One common approach is to use the pixel values of an image as the features. For example, an RGB image with dimensions 100×100 pixels can be represented as a feature vector of length 30,000 (100x100x3), where each element of the vector corresponds to the pixel value of the corresponding RGB channel.
Once the feature vectors are extracted from the images, the KNN algorithm can be used to classify new images by finding the k nearest neighbors of the new image’s feature vector in the training set and assigning the new image to the majority class among its neighbors.
However, using pixel values as features can result in high-dimensional feature vectors that are computationally expensive to process. To address this issue, techniques like Principal Component Analysis (PCA) and Convolutional Neural Networks (CNNs) can be used to reduce the dimensionality of the feature vectors and extract more meaningful features that can improve the accuracy of the KNN algorithm.
Sentiment Analysis
The KNN algorithm can be used in sentiment classification tasks by representing each text document as a feature vector, and then applying the KNN algorithm to the set of feature vectors. The process of transforming text (i.e., unstructured data) into vector representations is known as tokenization (check out our blog post on tokenization here to read more).
One common approach is to represent each document as a bag-of-words model, where each feature corresponds to a unique word in the vocabulary, and the value of each feature is the frequency of the corresponding word in the document. For example, consider the following two documents:
Document 1: “The movie was excellent and I highly recommend it.”
Document 2: “The movie was terrible and I do not recommend it.”
The vocabulary for these documents might be: [“the”, “movie”, “was”, “excellent”, “and”, “i”, “highly”, “recommend”, “terrible”, “do”, “not”]. The feature vectors for these documents would then be:
Document 1: [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
Document 2: [1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1]
Once the feature vectors are extracted from the text documents, the KNN algorithm can be used to classify new documents by finding the k nearest neighbors of the new document’s feature vector in the training set and assigning the new document to the majority class among its neighbors.
However, similarly to using pixel values as features for image classification, using the bag-of-words model as features can result in high-dimensional feature vectors that are computationally expensive to process. To address this issue, techniques like Term Frequency-Inverse Document Frequency (TF-IDF) weighting and word embeddings can be used to represent the documents as lower-dimensional feature vectors that can improve the accuracy of the KNN algorithm.
Implementation
Preprocessing Data
Before applying KNN, it is essential to preprocess the data to remove any noise, outliers, and missing values. This can involve techniques such as scaling or normalizing the data, imputing missing values, or reducing the dimensionality of the data.
Scaling the data is an important preprocessing step for the KNN algorithm because the algorithm is distance-based. If the data is not scaled, features with larger scales will dominate the distance calculation and can result in incorrect classifications. For example, consider a dataset with two features, one of which is age (ranging from 0 to 100) and the other is income (ranging from 0 to 1,000,000). If the data is not scaled, the distance calculation will be dominated by the income feature, and the age feature will have a much smaller impact on the classification decision. This can lead to inaccurate classifications, especially if the features have very different scales. Scaling the data ensures that some features don’t dominate others just because of the magnitude of the values.
Scaling the features can improve the accuracy of the KNN algorithm and ensure that it’s robust to changes in the scale of the data. Standardization is one common scaling technique that is often used for KNN, as it transforms the data to have a mean of 0 and a standard deviation of 1, which can help to mitigate the effects of differences in scale between features.
from sklearn.preprocessing import StandardScaler
# Create a StandardScaler object
scaler = StandardScaler()
# Fit the scaler to the feature set and transform it
X_scaled = scaler.fit_transform(X)
Choosing K Value
After scaling the data, we need to choose our hyperparameters (i.e., the value of k). Hyperparameters refer to parameters that we choose for the algorithm, or in other words, the parameters that aren’t learned during training. Choosing the right value of k is critical for the performance of the KNN algorithm. If k is too small, the algorithm may be overly sensitive to noise in the data, which can lead to overfitting. On the other hand, if k is too large, the algorithm may oversimplify the decision boundaries and fail to capture important patterns in the data, which can lead to underfitting.
The best value of k depends on the specific dataset and problem that we are working with. Generally, smaller values of k are better suited for datasets with a lot of noise or with complex decision boundaries, while larger values of k are better suited for datasets with smoother decision boundaries.
One common technique for choosing the optimal value of k is to use cross -validation, where we split the data into training and testing sets multiple times and evaluate the algorithm’s performance for different values of k. This allows us to choose the value of k that results in the best overall performance on the dataset.
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
# Perform cross-validation for different values of k
k_values = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]
cv_scores = []
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_scaled, y, cv=10, scoring='accuracy')
cv_scores.append(scores.mean())
# Find the optimal value of k
optimal_k = k_values[cv_scores.index(max(cv_scores))]
It’s also important to note that the choice of k is not the only hyperparameter that can affect the performance of the KNN algorithm. Other factors, such as the distance metric used to measure distances between data points, can also have a significant impact on the algorithm’s performance. Therefore, it’s important to experiment with different hyperparameters and evaluate their impact on the algorithm’s performance.
Measuring Performance
After choosing hyperparameters and training the model, we now need to evaluate our model’s performance. This comes in 2 steps – choosing a performance metric and evaluating the model on a new dataset with that performance metric.
The choice of performance metric largely depends on the problem that KNN is trying to solve. Some common classification metrics include accuracy, precision, recall, f1 score, ROC AUC, and PR AUC. Common regression metrics include MSE, RMSE, MAE, and R2. The right metric to choose also depends heavily on the dataset. For example, choosing accuracy as the evaluation metric for highly imbalanced datasets is often not as good of a metric as precision or recall. Take a dataset that has 98% of examples in the majority class, and the other 2% of examples belong to the other class. If our model just predicted the majority class each time, we would achieve roughly 98% accuracy, which is a false representation of how well our model can differentiate between the two classes.
Once the metric is chosen, there are several ways to evaluate the performance of the KNN algorithm, each with its own benefits and limitations. Two common approaches are cross -validation and training/testing split. Overall, both cross -validation and training/testing split are useful approaches for evaluating the performance of the KNN algorithm, and the choice between them depends on the specific problem and the available resources. Training/testing split is a quick and easy way to evaluate the performance of the algorithm, but it may suffer from overfitting or underfitting. Cross -validation provides a more reliable estimate of performance, but can be computationally expensive.
In our example above, we captured the average of the cross validation scores for each value of k that we tested. Thus, we already have our cross validation score saved in the cv_scores array.
# Get the best cv score from the cv_scores array
best_cv_score = cv_scores[cv_scores.index(max(cv_scores))]
Limitations of KNN Algorithm
Curse of Dimensionality
The curse of dimensionality is a term used to describe the challenge of working with high-dimensional data. In a high-dimensional space, the distance between points becomes less meaningful as the number of dimensions increases. This is because the volume of the space increases exponentially with the number of dimensions, which leads to a sparse distribution of data points. As a result, the nearest neighbors to a given point may be very far away, making it difficult to find a reliable set of neighbors to use in the KNN algorithm.
In addition, as the number of dimensions increases, the amount of data required to represent the space effectively also increases. This means that as the number of dimensions grows, the amount of data required to produce reliable results can become prohibitively large. As a result, the computational cost of the KNN algorithm can become very high, making it difficult to use in practical applications.
To address the curse of dimensionality, various techniques have been developed, such as feature selection, dimensionality reduction, and data preprocessing. These techniques can help to reduce the number of dimensions in the data, making it more manageable and easier to work with. Additionally, other algorithms that are less sensitive to high-dimensional data, such as decision trees, neural networks, or support vector machines, may be used as an alternative to KNN.
Imbalanced data
Imbalanced data can also affect the performance of the KNN algorithm by affecting the distance calculation between samples. Recall that KNN is a distance-based algorithm that computes the distances between samples to find the k-nearest neighbors. In imbalanced datasets, the samples of the minority class are often sparse and scattered, and the majority class dominates the feature space. As a result, the distances between the minority class samples may be higher, and the distances between the majority class samples may be lower, leading to misclassifications.
One solution is to use resampling techniques to balance the dataset. Resampling techniques can either oversample the minority class by creating synthetic samples or undersample the majority class by removing some samples. This way, the density of the minority class samples can be increased, and the distances between the minority class samples can be reduced, leading to better classification performance. However, it is important to be careful with resampling techniques as they can introduce bias and lead to overfitting. Therefore, it is recommended to use cross -validation or other evaluation metrics to validate the performance of the model.
Computationally Expensive
The computational complexity of the KNN algorithm mainly depends on the size of the dataset, the number of features, and the value of k. The time complexity of the KNN algorithm for a single query point is O(nd), where n is the number of training examples and d is the number of features. This is because for each query point, the algorithm needs to compute the distance between the query point and every other point in the dataset. As the number of features or the size of the dataset increases, the computational complexity of KNN also increases significantly, making it computationally expensive and impractical for large datasets. Additionally, finding the optimal value of k by brute force can also increase the time complexity of the algorithm.
Improving KNN Algorithm
Feature Selection
Feature selection refers to the process of selecting a subset of relevant features from a larger set of features in a dataset. The aim of feature selection is to remove irrelevant and redundant features, thereby improving the performance of the algorithm.
In the case of KNN, feature selection can be used to improve performance in several ways. One of the main benefits is that it can reduce the dimensionality of the dataset, which in turn can significantly reduce the computational complexity of the algorithm. This is particularly important in large datasets where the number of features can be very high.
By reducing the dimensionality of the dataset, feature selection can also help to remove noisy and irrelevant features, which can reduce overfitting and improve the accuracy of the algorithm. Furthermore, by reducing the number of features, feature selection can also help to improve the generalization of the algorithm, making it more robust to new data.
Overall, feature selection can be a powerful tool for improving the performance of KNN, particularly in cases where the number of features is high, and computational complexity is a concern.
Dimensionality Reduction Algorithms
Similarly to feature selection, dimensionality reduction algorithms can help with KNN performance by reducing the number of features or variables used in the distance calculation, thus reducing the computational complexity of the algorithm. These algorithms aim to find a lower-dimensional representation of the data that retains most of the information.
For example, Principal Component Analysis (PCA) is a commonly used dimensionality reduction technique that transforms the data into a lower-dimensional space by projecting the original data onto a smaller number of principal components, which are linear combinations of the original features. The principal components are chosen in such a way that they explain most of the variance in the data. By using PCA, one can reduce the number of features used in the distance calculation while retaining most of the information in the data.
Other dimensionality reduction algorithms include SNE, t-SNE, and UMAP. To read more about these algorithms, check out our blog post on dimensionality reduction.
Ensemble Methods
Ensemble methods can improve the performance of KNN by combining the predictions of multiple KNN models. One way to do this is by using a technique called bagging, where multiple KNN models are trained on different subsets of the data, and their predictions are combined through averaging or voting. Another way is by using boosting, where KNN models are sequentially trained on the data, with each subsequent model focused on correcting the errors of the previous ones. By combining multiple models, ensemble methods can improve the stability and robustness of KNN, and reduce the impact of noisy or irrelevant features in the data.
Conclusion
While KNN has several advantages such as ease of implementation, interpretability, and no assumptions about data distribution, it also has its limitations. The algorithm is sensitive to the choice of k, and can be affected by the curse of dimensionality, imbalanced data, and computational complexity issues. However, with appropriate preprocessing steps such as scaling, feature selection, and dimensionality reduction, as well as the use of ensemble methods, these limitations can be mitigated. Overall, KNN is a valuable tool in the machine learning toolbox and can be useful in a variety of real-world applications.