Exploratory Data Analysis (EDA):

EDA comes under data analysis using python, and it is there for gaining better understanding of data science aspects such as

  • The main features of data.
  • The variables and the relationships which they hold among themselves.
  • To identify variables which are important for the problem in hand.

Here we will look at various EDA methods such as:

  • Descriptive statistics, which gives a brief overview of the data set in hand, including some features and measures of the sample.
  • Grouping of data ( group for basic grouping).
  • ANOVA, Analysis of Variance is a computational method there for dividing variations in observation set into different components.
  • Correlation and the methods of correlation.

The data set will be chile voting dataset, and it can be imported in the following way:

Descriptive Statistics:

It is a way to understand the characteristics of the data and to obtain a quick summary of it. In Pandas, there is a method known as describe(). This function describe() applies basic statistical, computational outputs on the dataset such as standard deviation, extreme values, etc. So, missing values and NaN are skipped. By using the describe() function, we can get a good idea of distribution of data.

DF.describe()

OUTPUT:

Exploratory Data Analysis (EDA) using Python for Data Science - PST

We can use another method value_counts() that can give the count of each category in categorically attribute series of values. Example: Assume we are using a dataset of customers divide into youth, medium, and old categories under the column name and age. So, our dataframe is “DF”. In order to know the number of people who fall in this category we can run the following statement.

OUTPUT:

Another useful tool is the boxplot which is accessed through the matplotlib module. It is a pictorial representation of the distribution of the data. It contains extreme values, medians, and quartiles.

OUTPUT:

Grouping of Data:

Group is a measure in pandas which shows us the effect of different categorical attributes on the other data variables. Example: We will use the dataset above and find the effect of people’s education and age on the dataset voting.

OUTPUT:

ANOVA:

It stands for Analysis of Variance. It gives the relationship between the different groups of categorical data.

In ANOVA, two measures are present as result:

  • F-testscore: This shows the variation of group means over variation.
  • p-value: It denotes the importance of the result.

ANOVA is performed by Python module scipy method name f_oneway(). The syntax is:

Exploratory Data Analysis (EDA) using Python for Data Science - PST

The samples are the sample measurement of each group.

It can be concluded that, there is a strong correlation between other variables and categorical variables if the F-test value is large and p-value is small.

Correlation and Correlation computation:

Correlation is the relationship between two variables such that one variable has an effect on others. But it is different from the act of causing. Pearson correlation is a way of calculating correlation among variables. Two parameters are present, namely Pearson coefficient and p-value. Two variables have a strong correlation when the Pearson correlation coefficient is close to either 1 or -1, and p-value is less than 0.0001.

Scipy module has a method of performing Pearson correlation analysis. The syntax is:

Exploratory Data Analysis (EDA) using Python for Data Science - PST

Here sample 1 and sample 2 are the attributes we want to compare.

To learn more about EDA in python for data science, you can check this and this as well. If you are interested in learning data science in more details then you can contact us. You might need a road map as well through which you can proceed on this path.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.