Implementation of K nearest neighbors:

Suppose we have a set of data set items that have numerically values features. The count of the features is n; we will be able to represent the items as a point in an n-dimensional grid. If we are given a new item, we can also measure the distance of each item from one another. We will pick the K closest neighbors and also see where most of the neighbors are classified. We will classify the new items there in python for data science.

Now we will deal with the problem of calculating the distance between the items. The solution will depend on the dataset. In case of real values we will generally use Euclidean distance. In case the values are categorical or binary, we use the Humming distance.

Algorithm:

Implementation of K nearest neighbors in Python for Data Science - PST

Reading the data:

Our input file will be in the following format:

Implementation of K nearest neighbors in Python for Data Science - PST

In this, each item is a line, and we can see the classification of items in class. Values under feature names is value the item has for that particular feature. All values and features are separated by commas.

So, we will place these data files in the working directory data2 and data. We will choose one, and then we will paste the content into the text file named data in python for data science.

We read from the file and then split the input by lines:

First line of the file will hold the feature names. The class keyword will be at the end. We have to store the feature names into lists.

Now we will move on the data set. The items have to be saved in a list named as items. The elements of the items are dictionaries. Keys to item dictionaries are feature names and class for holding the item class. Lastly, we want to shuffle the items present in the list.

Implementation of K nearest neighbors in Python for Data Science - PST

Classification of the data:

We have stored our data into items, and now we will build our classifier. For this purpose, we need to build a new function called Classify. This function will take as input the items we want to classify, the item listsand K, and the number of closest neighbors.

In case k is greater than length of data set, we will not go with the classification because it is not possible to have closest neighbors more than the total items in the data set.

Now we will see the calculation of distance between items to be classified and all items in the training set and in the end will be keeping the k shortest distance. For keeping the current closest neighbors we will take the help of a list known as neighbors. One element will hold two values, one will be for distance of item to be classified and another one will be for class the neighbor is in. The distance will be classified by using generalized Euclidean formula. Further we will pick the class which appears most of the time in neighbors and it will be our pick.

Implementation of K nearest neighbors in Python for Data Science - PST

We need to implement the EuclideanDistance, UpdateNeighbors, CalculateNeighborsClass and FindMax.

Finding Euclidean Distance:

Generalized Euclidean formula for two vectors x and y is as follows:

Implementation of K nearest neighbors in Python for Data Science - PST

Updating Neighbors:

Suppose we want to add an item to our neighbors list with a given distance, we will follow the given steps. First of all we will check the neighbors have a length of k or not. In case it has less we will add the item regardless of the distance. Otherwise we will check if the item has a shorter distance than the item in the max distance in the list. In case it is true, we will replace the max distance with the new item.

For finding maximum distance item quickly, we keep the list sorted in ascending order. The last item in the line will have the maximum distance. We replace it with the new item and will sort again.

For speeding up the process, we implement an insertion sort where we will insert new items in list without having to sort the entire list.

Implementation of K nearest neighbors in Python for Data Science - PST

CalculateNeighborsClass:

In this we will calculate the class appearing most often in the neighbors. For this purpose we use a dictionary known as count. Here te keys are the class names appearing in neighbors. In case a key does not eist, we need to add it or else we will increment its value.

Implementation of K nearest neighbors in Python for Data Science - PST

FindMax:

In this function we will input the count dictionary we have build in CalculateNeighborsClass and will return the max.

Conclusion:

We have finished the kNN tutorial with the last step. Now we look at an example to make things clear.

The code for this will be:

OUTPUT:

0.9375

So, to learn more about it in python for data science, you can check this and this as well.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.