Removal of stop words with NLTK:

Pre-processing is the process of converting data to computer understandable form. One major form of pre-processing is filtering out useless data. In NLP using python for data science, useless words are known as stop words.

Stop words:

These are the commonly used words like the, an, a, etc. which the search engine is programmed to ignore during both the indexing of entries and while retrieving results of search query.

These stop words should ignore as they take up space in our database and also consumes more time. We can remove them by storing them in a list of words which we consider as stop words. NLTK has a list of stopwords which is store in sixteen different languages. We can see them in nltk_data directory. The directory address is:

home/pratima/nltk_data/corpora/stopwords

We should change our home directory name first.

Removal of stop words with NLTK in Python for Data Science - PST

For checking the list of stop words, we can type the following commands in the python shell.

OUTPUT:

Removal of stop words with NLTK in Python for Data Science - PST

Removal of stop words using NLTK:

The below program removes stop words from a piece of text:

Performing stop word operation in a file:

Suppose we need to remove stop words from the file text.txt we will use the below code for this purpose.

So, to learn more about it in python for data science, you can check this and this as well.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.