It is an informative analysis of data that will take many relationships into account. Now we will look at some of the basic techniques there for multidimensional data science using open source libraries in python.
We will now look at code for reading 2D tabular data from zoo_data.csv.
In the above example, we can classify the data into clusters or sub groups using some clustering techniques such as Kmeans clustering, DBscan, KNN (K nearest neighbor) and hierarchical clustering. We will use Kmeans cluster in this case. We can cluster data by K means cluster by using Kmeans module belonging to cluster class and of sklearn library.
In the above example, the returned cluster inertia is 119.70392382759556. This value is stored in kmeans.inertia_ variable.
For performing EDA analysis, we have to reduce the dimensionality of multivariate data and make it trivariate or bivariate data. This task can be achieved by using PCA (Principal Component Analysis).
PCA can be carried out by using the PCA module of class decomposition of library sklearn in the following way.
Scatter plot is a 2D or 3D plot which helps in analyzing various clusters in 2D or 3D data.
The scatter plot of 3D reduced data we have earlier produced can be plotted in the following manner.
The below code generates an array of colors sorted in order of their hue, value, and saturation values. All colors are associated with a single cluster, and it is there for denoting animals as a 3D point while plotting it in 3D plot.
The below code will generate a 3D scatter plot in which each data point has a color related to the corresponding cluster.
We can observe that the scatter plot can lead to the hypothesis that clusters formed using initial data will not have good explanatory power. For solving this issue, we need to bring down the set of features to a more useful set of features using this, we will be able to generate useful clusters. One way of producing such a set of features is carrying out correlation analysis. We can do this by plotting heatmaps and trisurface plots in the following way.
The following code is there for generation of a trisurface plot of correlation matrix by making a list of tuples. The tuples contains coordinates and correlation value in the order of animal names.
The pseudo-code for the above expressions are as follows:
Code for the generation of trisurface plot for correlation matrix:
By using heatmaps and trisurface plots, we can make an inference on the selection of smaller set of features provided for performing cluster analysis. In general feature pairs with extreme correlation values carry high explanatory power, and it can be there for doing further analysis.
In this case, by looking at both the plots, we will arrive at a rational list of 7 features.
While running cluster analysis again on the subset features, we will generate a scatter plot with better inference on spreading different animals among various groups.
We will observe reduced overall inertia of 14.479670329670329, which is less than initial inertia.