Working With Text Data in Pandas:

The series and indexes are equipped with set of string processing methods which make it easy to operate on each element of array. These methods will exclude missing NA values automatically in Python while using it for data science. These will be accessed using the str attribute and have names which match the equivalent built-in string methods.

Lowercasing and uppercasing a Data:

We use str.lower() function for converting all uppercase characters into lowercase. In case no uppercase character is present it will return the original string. The function str.upper() for converting all lowercase characters into uppercase. In case no lowercase character exists it will return the original string.

Code 1:

Working With Text Data in Pandas Python for Data Science - PST Analytics

OUTPUT:

Code 2:

Working With Text Data in Pandas Python for Data Science - PST Analytics

OUTPUT:

We can see from the output that all values in team column has converted into uppercase.

Working With Text Data in Pandas Python for Data Science - PST Analytics

Splitting and replacing a Data:

For splitting a data, we use the str.split() function that returns a list of strings after breaking given string by the separator specified. But it will be applied to an individual string. We can apply pandas str.split() method to the whole series. We need to prefix the .str every time before calling for differentiating it from Python’s default function. It will throw an error otherwise. For replacing a data, we use str.replace() function which works similar to .replace() method, but str.replace() works on series too.

Code 1:

Working With Text Data in Pandas Python for Data Science - PST Analytics

OUTPUT:

Working With Text Data in Pandas Python for Data Science - PST Analytics

From the output, we can see that the address column was separated at the first occurrence of “a” and not on later occurrence since n parameter was set to 1.

Code 2:

Working With Text Data in Pandas Python for Data Science - PST Analytics

OUTPUT:

As we can see from the output, values in Age column which have age=25 has been replaced by “Twenty five”.

Concatenation of Data:

For concatenating a series or index, we use the str.cat() function. We can pass distinct values from different series, but the length of both series has to be the same. We have to prefix .str for differentiating it from Python’s default method.

Code 1:

OUTPUT:

We can see from the output image that every string in the address column which has same index as string in Name column has been concatenated with “,” separator.

Working With Text Data in Pandas Python for Data Science - PST Analytics

Code 2:

OUTPUT:

From the output, we see that the Name and Team columns have been concatenated using the separator “,”.

Working With Text Data in Pandas Python for Data Science - PST Analytics

Removal of whitespaces of Data:

For removal of whitespaces we use the functions str.rstrip(), str.lstrip() and str.strip(). The function str.rstrip() removes white spaces from the right side, str.lstrip() removes whitespces from left side and str.strip() removes white spaces from both the sides.

Code 1:

OUTPUT:

From the output, we see that the comparison returns false for all three conditions which means the spaces were removed successfully from both sides and the string no longer has spaces.

Code 2:

OUTPUT:

We can see from the output that comparison becomes true after removing left side spaces.

Extracting a Data:

For extraction of data, we will use the str.extract() function which will accept regular expression with at least one capture group. When we extract a regular expression with more than one group, it will return a DataFrame with one column per group. Elements which do not match returns a row filed with NaN.

Code 1:

OUTPUT:

As we can see in the output, two groups will return a DataFrame with two columns. The non-matches will return NaN.

Code 2:

OUTPUT:

As we can see from the output, named groups will become name columns in the returned result.

Pandas str methods:

Working With Text Data in Pandas Python for Data Science - PST Analytics

So, to learn more about casing in python for data science, you can check this and this as well.

Leave a Reply

Your email address will not be published. Required fields are marked *