NLP or Natural language processing is a part of computer science and artificial intelligence concerned with the interaction between computers and humans languages ( natural languages). It is used mainly for processing and analyzing large amounts of natural language data. It is a branch of machine Learning and is all about analyzing a text and predictive analysis or data science using python.
Now we will understand the steps involved in text processing and flow of NLP.
Step 1: First, we will import the dataset by setting the delimiter as ‘\t’ as columns are separated as tab space. The reviews and their category (0 or 1) are not separated by any other symbol but with tab space. It is because the other symbols are in review, and there are chances the algorithm might use them as delimiter, that will lead to strange behavior in output.
Step 2: Text cleaning or Pre-processing
- Removal of punctuations and numbers: The punctuations and numbers don’t help much in the processing of the text so, if these are included it will increase the size of bag of words that we will create at a later stage. This will decrease the efficiency of the algorithm.
- Stemming: Taking root of the words.
- Converting each word into the lower case: It is done because it is not useful to have the same word in different cases.
Example: We show before and after applying the code above.
Now, we will perform tokenization. This will involve the splitting of sentences and words from the body of the text.
Making bag of words by using sparse matrix
- We will take all the different words used in review in the dataset but will not repeat the words.
- Assign one column for one word so, there will be many columns.
- We should remember that the rows are the reviews.
- In case words are present in row of the dataset of reviews, the count of the word will be present in the row of bag of words under column of the word.
Example: We will consider a dataset of reviews containing only two reviews.
For fulfilling our purpose we will need CountVectorizer class from sklearn.faeture_extraction.text.
We can set the maximum number of features. We will perform the training on corpus and then we will apply similar transformation to the corpus “.fit_transform(corpus)” and further convert it into array. In case the review is positive or negative the answer will be in the second column of dataset[:, 1] : all rows and the first column.
Here we will split the corpus into training and test sets. In order to do so, we need class train_test_split from the sklearn.cross_validation. We can make split as 70/30 or 80/20 or 75/25 or 85/15. We have chosen 75/25 here via the “test_size”. X is bag of words and y is 0 or 1.
Step 6: Fitting of a predictive model. (we have used random forest here).
- As random forest is ensemble model from the sklearn.ensemble, we will import Random ForestClassifier class.
- We will set it with 501 trees or “n_estimators” and the criterion as ‘entropy’.
- We will fit the model by .fit() method and taking attributes X_train and y_train.
Step 7: Predicting the final result using .predict() method with the attribute X_test
Step 8: For finding the accuracy, confusion matrix is required.
A confusion matrix is a 2 x 2 matrix.
Note: True or False will refer to assigned classification being correct or not. Positive or negative will refer to the assignment to positive or negative category.