Ever wondered what the K-NN Algorithm is? If so, keep reading to find out!
First let's go over the two types of machine learning algorithms in machine learning:
- Supervised: This type of machine learning uses labeled datasets to train models in order to classify data and predict outcomes accurately. This type of learning is useful in predicting outputs or outcomes for new and unseen data on the basis of prior experiences. Some practical examples include classifying images, words or file types. Supervised machine learning is useful when you know what you're looking for in a given dataset
- Unsupervised: This is a type of machine learning in which machine learning models learn patterns through untagged and unlabeled data. The model analyses the data in order to find previously unknown patterns in the data. It can also be useful in finding features, which are independant variables that serve as input for a system, which is very useful in conducting categorization. Some practical examples include data exploration and social media recommendation systems, target marketing campaigns. Unsupervised machine learning is useful when you don't know what you're looking for in a given dataset
Basic Overview of K-Nearest Neighbors algorithms
K-Nearest Neighbors algorithms are considered to be one of the simplest Machine Learning algorithms that falls into the Supervised algorithms category.
K-Nearest Neighbors algorithms operate on the basic assumption that closely similar things exist in closer proximity to each other that more dissimilar things. With this initial assumption, the K-NN algorithm will place new unseen data into a category that is most similar to the available categories.
The K-NN algorithm is considered a lazy learning type of algorithm because it doesn't immediately learn from the training data assigned to it. Instead, it merely stores the training data and at the time of needing to classify new and unseen data, it places that data into a category that is the most similar to the new data.
The K-NN algorithm is known as an ad-hoc classifier that is used to classify test data based on distance metrics(such as Euclidean, Chebyshev, Cosine, etc)
K-NN algorithms can be used for both types of Supervised Machine Learning categories: Regression and Classification. Despite this flexibility, it is mostly used for classification ML problems.
Here are some pros and cons of the K-Nearest Neighbors algorithms:
- It is relatively easy and straight forward to setup
- It handles noisy training data pretty well, meaning it predicts accurately despite there being noisy data in the training set. Noise in this context refers to inaccurate data as a result of human error in the data collection process for the dataset being used
- It is efficient and effective at using large training sets in the prediction process
- The value of K always needs to be calculated prior to making any predictions. This can get quite tedious and complex at times
- Since the distance between all the training set data points in relation to each other have to be calculated the computational cost can get quite high at times
Basic flow of KNN
The following list of steps provides a general overview of how the KNN algorithm works:
- Load and split the data into train and test sets
- Pick a value for K. Look for the section below for a more in-depth explanation on how to do so
- In order to generate a prediction, do the following steps for all iterations from 1 to the total number of training data points:
- First, calculate the distance between the test data and each row of training data. The distance metric to be used can be any of the following: Euclidean, Chebyshev, Cosine, Manhattan, Minkowski, Jaccard, etc
- Sort the calculated distances in ascending order based on distance values
- Get the number of rows equal to the value of k from the top of the sorted array. Ex: if k = 4, then get the top 4 rows from the sorted array
- Get the most frequent category of the rows selected in step 3
- Return the most frequent category as the predicted category for the current training data point
Picking a value for K
The value of K in the K-NN algorithm indicates the count of the nearest neighbors.
The value of K should optimally be a number that is not too small and not too large. Choosing a value of K that is too small leads to unstable decision boundaries and overfitting problems. Meanwhile, having a value of k that is too large can lead to underfitting problems. It is also important to note that the value of k is non-parametric, meaning it does not rely on any assumptions about the shape or parameters of the dataset from which the training data was drawn.
First of all, keep in mind that there is no definite way to pick the best value for K. The most practical and widely-used strategy to use trail and error to find an optimal K value. To do this, you have to pretend that part of your training data is "unknown".
Alternatively, you can use this general rule of thumb for choosing the best value of k using this following formula:
k = sqrt(N)/2
N stands for the number of rows in the training set.
It also might help to keep the value of k to an odd number so as to not create any stalemate/tie situations, arising from choosing a class/category.
Thanks for reading this blog post!
If you have any questions or concerns please feel free to post a comment in this post and I will get back to you if I find the time.