Tokenizing text using NLTK in Python:

The NLTK module is a huge tool kit which aims at helping us with the entire NLP methodology. It has many applications in python for data science.

For installing NLTK program, we will run the following command in our terminal:

  • sudo pip install nltk
  • Enter the python cell in terminal by typing python
  • Now import nltk
  • download(‘all’)

Above installation takes time because of the amount of tokenizers, chunkers and other algorithms and all corpora it has to download.

Some used terms are as follows:
  • Corpus: The body of the text is singular. The plural of this is corpora.
  • Lexicon: These are the words and their meanings.
  • Token: Entity which is part of whatever was split up based on rules is token.

Tokenizing involves splitting sentences and words from the body of the text.


Tokenizing text using NLTK in Python for Data Science - PST Analytics

In this, we have created tokens which are sentences initially and words later.

So, to learn more about it in python for data science, you can check this and this as well.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.