Merging, Joining and Concatenation in Pandas:

Concatenation of dataframe in Python for Data Science

For concatenation of dataframe, we will use the concat() function which helps in concatenating a dataframe. The different ways of concatenating a dataframe are give below.

  • Concatenating a dataframe using the concat()
  • Concatenate dataframe by setting up logic on axes.
  • Concatenating dataframe using .append()
  • Concatenation of dataframe by ignoring the indexes.
  • Concatenate dataframe with group keys.
  • Concatenating with mixed ndims.
Concatenation of DataFrame by .concat() function:

The .concat() function will concatenate a datarame and will return a new dataframe.

Working With Date and Time in Pandas Python for Data Science - PST

Working With Date and Time in Pandas Python for Data Science - PST

OUTPUT:

Working With Date and Time in Pandas Python for Data Science - PST

Concatenation of DataFrame by setting logic on the axes:

For concatenating a dataframe, we need to set different logic on axes. Axes can be set in the three following ways:

  • By taking union of all , join=’outer’. It is default option as it will result in zero information loss.
  • Taking the intersection, join=’inner’.
  • Using specific index, as it is passed to join_axes argument.

Working With Date and Time in Pandas Python for Data Science - PST

OUTPUT:

Working With Date and Time in Pandas Python for Data Science - PST

We will set axes join=outer for the union of dataframe.

OUTPUT:

Working With Date and Time in Pandas Python for Data Science - PST

Now we will use a specified index as passed to join_axes argument.

OUTPUT:

Working With Date and Time in Pandas Python for Data Science - PST

Concatenation of DataFrame using .append():

For concatenation of a dataframe, we can use the .append() function. It is a function which concatenate along axis=0, namely index. This function will exist before .concat.

We will now apply .append() function for concatenating a dataframe.

OUTPUT:

Working With Date and Time in Pandas Python for Data Science - PST

Concatenation of DataFrame by ignoring indexes:

For concatenating a dataframe by ignoring indexes, we ignore the index which has no meaning. We can append them and ignore the fact that it may have overlapping indexes. To do this, we use ignore_index as an argument.

We will now apply ignore_index as argument.

OUTPUT:

Concatenation of DataFrame with group keys:

For concatenation of dataframe with group keys, we override column names using key argument.

Working With Date and Time in Pandas Python for Data Science - PST

OUTPUT:

Concatenation with mixed ndims:

We as a user can concatenate a mix of series and DataFrame. The series will transformto DataFrame with column name as name of the series.

OUTPUT:

Merging of DataFrame:

Pandas has an option for high performance in memory merging and joining. When there is a need for combining very large DataFrames, the joins will serve as a powerful way for performing these operations swiftly. Joins can only be performed on two DataFrames at a time, which is denoted as left and right tables. The key is common column which will join the two DataFrames. Keys should have unique values throughout the column for avoiding unitended duplication of row values. Pandas will provide a single function, merge(), as the entry point for all standard databases for joining operation between DataFrame objects.

We have four basic ways for handling join (inner, left, right, and outer) depending on this the rows must retain their data.

Code 1: Merging a dataframe using unique key combination.

OUTPUT:

Working With Date and Time in Pandas Python for Data Science - PST

Code 2: Merging a dataframe using multiple join keys.

OUTPUT:

Merging a dataframe using how in an argument:

The how argument is used for merging. It will specify the factors for determining the keys which are to be included in the resulting table. In case a key combination does not appear in the left or right tables, the values in joined table will be NA. Below is the summary of how options and their SQL equivalent names:

OUTPUT:

OUTPUT:

OUTPUT:

Working With Date and Time in Pandas Python for Data Science - PST

OUTPUT:

Joining DataFrame:

It is a function used for combining columns of two potentially differently indexed DataFrames into a single result DataFrame.

OUTPUT:

OUTPUT:

Joining dataframe using ‘on’ in an argument:

For joining dataframes we use on in an argument. The function join() will take an optional on argument that may be a column or multiple column names. This will specify the passed DataFrame has to be aligned on that column in DataFrame.

Working With Date and Time in Pandas Python for Data Science - PST

OUTPUT:

Joining singly indexed DataFrame with multi indexed DataFrame:

For joining singly indexed dataframe with multi indexed dataframe, the level will match on name of the index of singly indexed frame against a level name of the multi indexed frame.

Working With Date and Time in Pandas Python for Data Science - PST

OUTPUT:

So, to learn more about Pandas in Python for Data Science, you can check this and this as well.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.