Pre-processing is the process of converting data to computer understandable form. One major form of pre-processing is filtering out useless data. In NLP using python for data science, useless words are known as stop words.
These are the commonly used words like the, an, a, etc. which the search engine is programmed to ignore during both the indexing of entries and while retrieving results of search query.
These stop words should ignore as they take up space in our database and also consumes more time. We can remove them by storing them in a list of words which we consider as stop words. NLTK has a list of stopwords which is store in sixteen different languages. We can see them in nltk_data directory. The directory address is:
We should change our home directory name first.
For checking the list of stop words, we can type the following commands in the python shell.
Removal of stop words using NLTK:
The below program removes stop words from a piece of text:
Performing stop word operation in a file:
Suppose we need to remove stop words from the file text.txt we will use the below code for this purpose.