Pandas DataFrame in Python:

In Python Pandas DataFrame is a two-dimensional size-mutable, heterogeneous tabular data structure which has labeled axes in Data Science. In a data frame, data is aligned in tabular fashion into rows and columns. In Pandas data frame, we have three principal components, namely, data, rows, and columns.

Pandas DataFrame in Python for Data Science - PST Analytics

Some of the basic operations perform on Pandas Dataframe are as follows.

Creation of Pandas DataFrame:

Pandas DataFrame is create by loading the datasets from the existing storage. The storage can be SQL database, CSV file, and Excel file. We are able to create Pandas DataFrame from lists, dictionaries, and from list of dictionary, etc. Below some ways of creating dataframe are discussed.

Creation of dataframe using List: We can create dataframe using a list or a list of lists.

Pandas DataFrame in Python for Data Science - PST Analytics

OUTPUT:

Pandas DataFrame in Python for Data Science - PST Analytics

Creation of DataFrame from dict of ndarray/lists: In order to create DataFrame from dict of ndaray/list, all ndarray must be of same length. S, in case of index being passed, length of index should be equal to length of arrays. The case of no index being pass then by default, index will be considered range(n) (n is array length).

OUTPUT:

Pandas DataFrame in Python for Data Science - PST Analytics

Dealing with Rows and Columns:

Data frame data is aligned in tabular manner into rows and columns. Basic operations like selecting, deleting, adding and renaming can be done on the rows and columns.

Column selection: To select a column in Pandas dataframe, we can access the column by calling them by their name.

Pandas DataFrame in Python for Data Science - PST Analytics

OUTPUT:

Row selection: The method DataFrame.loc[] is there for retrieving rows from Pandas DataFrame. We can also select rows by passing integer location to iloc[] function.

Pandas DataFrame in Python for Data Science - PST Analytics

OUTPUT:

Pandas DataFrame in Python for Data Science - PST Analytics

Indexing and Selecting Data:

Indexing is also known as Subset selection. In pandas indexing means selecting specific rows and columns of data from a particular data frame. In indexing we can either select entire rows and some of the columns, entire columns and some of the rows or some of each row and column.

Indexing a DataFrame by indexing operator[]:

Indexing operator refers to square brackets after an object. Some other indexing operators there for selection are .loc and .iloc.

Selecting a single column:

Simply put the name of the column to select a single column.

Pandas DataFrame in Python for Data Science - PST Analytics

OUTPUT:

Indexing dataframe using .loc[ ]:

It is a function which selects data by label of rows and columns. The dfloc indexer will select data differently from just indexing operator. It is able to select subsets of rows and columns even simultaneously.

Selecting a single row:

For the selection of single row by using loc[ ], we have to put a single row label in a .loc function.

OUTPUT:

Indexing a dataframe using .iloc[ ]:

It is a function which helps us to retrieve rows and columns by position. The dfiloc indexer is almost similar to df.loc except it uses integr locations for making selections.

Selecting a single row:

For the selection of single row using .iloc[], we pass a single integer to iloc[] function.

OUTPUT:

Working with missing data:

In case no information is provided for one or more items missing data will occur. Missing data is also referred to as NA (not available) values in pandas.

Checking missing values using isnull() and notnull():

Both the isnull() and notnull() function help in finding whether a value is NaN or not. These can also be there for finding null values in series.

OUTPUT:

Filling of missing values using fillna(), replace() and interpolate():

The functions fillna(), replace() and interpolate() replace NaN values by values of their own in python for data science. The interpolate() function uses interpolation techniques for filling the NA values and does not use hard coding.

OUTPUT:

Dropping missing values using dropna():

This function will drop rows/columns of datasets having null values in different ways.

Now let us drop rows with at least one NaN value.

OUTPUT:

Iterating over rows and columns:

A data frame is iterated like a dictionary as it consists of rows and columns.

Iterating over rows:

To iterate over rows three functions are available iteritems(), iterrows(), itertuples.

Now let us apply iterrows() functions to get each element of rows.

OUTPUT:

Pandas DataFrame in Python for Data Science - PST Analytics

Iterating over columns:

For iterating over columns, creation of list of dataframe columns is needed. Then we need to iterate through that list in order to put out the dataframe columns.

In order to iterate through columns, we will create a list of dataframe columns and then we will iterate through list.

OUTPUT:

DataFrame Methods:

Pandas DataFrame in Python for Data Science - PST Analytics

So, to learn more about pandas in python for data science, you can check this and this as well.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.