Books and documents content analysis in Python for Data Science:
In written texts, the patterns used will not be the same for all languages and authors. This fact allows the linguistics to find the language of origin and the authorship.
Here we will be examining the properties of an individual book collection from various authors and languages. We will look at the length of the book, see the unique words, and find how the properties differ for different authors and languages.
We can download the books from the Gutenberg project which houses over 50,000 books from various authors and languages. Now, we will download some English and foreign language books for analyzing. We will put these books in books folder and make subfolders for different languages.
Here we will build a function for counting the frequency of words in a text. First, we will consider a sample, and then we will replace it with text file of the books we have. We will change the text to lowercase as case does not matter, but the word matters.
We will look at two ways of counting word frequency. First, we will see the method of for loop and then using Counter from collections. The second one is faster and used generally. The function returns a dictionary consisting of unique words and its frequency as a key-value pair.
Reading books in Python: Now we will build a function read_book() that will read the books in Python and save it in the form of a long variable and also return it. The parameter of function will be the location of the book.txt we want to read, and it will be passed while calling the function.
Unique words: Here, we will look at a function word_stats() which takes the word frequency dictionary as a parameter. The function will return the total number of unique words and dict_values containing total count of them all together in the form of a tuple.
Calling the functions: Now we will read a book and then collect the required information on word frequency, word count, etc.
Plotting characteristic features in books:
Here we will take the help of matplotlib for plotting Book length vs. Number of unique words for the books we have downloaded. We need to import pandas library for the creation of a pandas dataframe for holding information on books as columns. We will be plotting book-length in x-axis and number of unique words in y-axis.
The first plot represents, every book of different languages and author as a book. The log plot creates the discrete points (red dots), and linear plot creates the linear curve (blue line). The blue line joins the red dots.
The second plot is logarithmic in nature and displays books of different languages in various colors as discrete points.