# K means Clustering – Introduction:

Suppose we have a set of items and all of them have certain features and values for those features. Now we need to classify them into groups in python for data science. We will use k means algorithm for achieving this task. It is an unsupervised learning algorithm.

###### Overview:

It will be easy to understand if we think of items as points in n-dimensional space. The algorithm categorizes the items into k groups of similarity. For the calculation of the similarity, we will use Euclidian distance as measurement.

The algorithm works in the following manner:

1. First, we will initialize the k points known as means, in a random manner.
2. Then we need to categorize each item to its closest mean and then update the means coordinates, which are averages of item categories in that mean till now.
3. The process is repeate for a given number of iterations, and at last, we have the clusters.

The points we have mentione are called means as they hold the mean values of the items categorized in it. We have a lot of options for initializing these means. One intuitive method is initializing the means at random items in data set. Another method is initializing the means at random values between boundaries of data set.

Below is the pseudo-code for the above algorithm: Inputs are received as text file (data.txt). Each line will represent an item, and it will contain numerical values split by commas.

The data is read from the file and saving it into a list. Each element of the list is another list that contains item values for the features. ###### Initialize means:

We need to initialize each mean’s value in range of feature values of items. For this, we need to find out the max and min for each feature. The following function is there for this task. The variable minima and maxima are lists which will contain the min and max values of items respectively. We will initialize each mean’s feature values randomly between the corresponding minimum and maximum in the minima and maxima list. ###### Euclidean distance:

Euclidean distance is there as a metric of similarity for the data sets. We can also use another similarity metric depending on our data set items. ###### Update means:

In order to update a mean, we have to find the average values of the features for all the items in the mean or cluster. For this, we will add all the values and then divide it by the number of values. We can also calculate the new average without re-adding all the values in the following manner. Classification of items:

We now need a function for classifying n item to a group or cluster. For a given item we need to find its similarity to each mean and then classify the item to the closest one. Finding means:

For finding the means, we will loop through all items, and then we will classify them to their nearest clusters and at last update the mean of the cluster. The process is repeated for a given number of iterations. In case no item changes classification between iterations, the process is stopped as the optimal classification has been received.

The function given below will take an input k, which denotes the number of clusters required, items and number of maximum iterations. It will return the means and the clusters. The classification of items is stored in the array belongsTo, and the number of items in the cluster is stored in clusterSizes. ###### Finding clusters:

Now we want to find the clusters when mean is given. We need to iterate through all items and classify them to their closest cluster. Other popular similarity measures are given below: So, to learn more about it in python for data science, you can check this and this as well.

This site uses Akismet to reduce spam. Learn how your comment data is processed.