Text analytics


Data science is commonly viewed in the numerical realm, but this growing field can also be applied to non-numerical data, such as text. This tutorial explores some key algorithms for making sense of text data, including basic text analytics, Markov chains, and sentiment analysis.

Introduction to data science, Part 1: “Data, structure, and the data science pipeline” explores the various types of data and shows how to extract value from it. But, not all data is structured and in a form that makes it easy to manipulate. Some data, such as text, is unstructured and requires different mechanisms extract insights. Text analytics, or text data mining, is the process of deriving information from text using a variety of methods. This tutorial explores some basic techniques, with a look at more advanced approaches using the Natural Language Toolkit (NLTK).

Install the NLTK

To use NLTK, you need Python V2.7, 3.4, or 3.5. With one of those Python versions installed, simply perform the steps in Listing 1 to install NLTK. These instructions use pip, the Python package manager.

Listing 01. Installing NLTK

$ sudo pip install ‑U nltk
$ sudo pip install –U requests

To verify that you’ve installed NLTK correctly, try to import NLTK interactively through Python, as shown below. If the imports fail, there’s an installation issue.

Listing 02. Testing the installation

$ python
Python 2.7.12+ (default, Sep 17 2016, 12:08:02) 
[GCC 6.2.0 20160914] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> import requests

With NLTK installed, you can now follow along with the examples in the next three sections. You can find these examples on GitHub.

Basic text analytics

The NLTK provides a range of capabilities, but all of them require ingesting text to perform any kind of analytics. Let’s start with a look at text ingest and some simple analytics.

Listing 3 provides a simple example of ingesting a sample corpus and tokenization in two forms: sentences and words. I use the Python requests library to read text from Charles Darwin’s On the Origin of Species from Project Gutenberg. I then apply two tokenizers to the text response (where a tokenizer breaks a string into substrings based on boundary). In the first example, I use sent_tokenize to break the text into individual sentences (using a period, or full stop, [.] as a boundary). In the second example, I use word_tokenize to break the text into individual words (based on spaces and punctuation). The length of the sentence and word list are emitted along with a random sentence and word (as a function of the total size).

Listing 03. Tokenization of a sample corpus (tokens.py)

import nltk
import requests
import random
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

#Read the "Origin of Species"
r = requests.get("https://www.gutenberg.org/cache/epub/1228/pg1228.txt");

#Tokenize sentences from the text 
sent = sent_tokenize( r.text )

print len( sent )
print sentrandom.randint(0,len(sent))
#Tokenize words from the text
words = word_tokenize( r.text )

print len( words )
print words random.randint(0,len(words))     

The output from the Python script above is shown below. As you can see, the sample text consists of 4,901 sentences and 179,661 words.

Listing 04. Output of the tokenization script from Listing 3

$ python tokens.py 
But there is no obvious reason why, for instance, the wing of a bat, or
the fin of a porpoise, should not have been sketched out with all the
parts in proper proportion, as soon as any structure became visible in
the embryo.

Tokenization is a typical first step in processing a collection of text. In the next example, I introduce a built-in capability of the NLTK for identifying the most common words and symbols within a corpus. The NLTK includes a frequency distribution class called FreqDist that identifies the frequency of each token found in the text (word or punctuation). These tokens are stored as tuples that include the word and the number of times it occurred in the text. In this example, I use text that is resident within the NTLK (text7 refers to The Wall Street Journal corpus). I invoke the most_common function to limit the tuples that are output to the top eight and then print this set. Listing 5 shows the code.

Listing 05. Finding the most common words and symbols (most_common.py)

import nltk
import requests
from nltk import FreqDist
from nltk.book import *

fd = FreqDist(text7)

mc = fd.most_common(5)

print mc

The output from Listing 5 is shown in Listing 6. From this, you can see the import of the sample texts into Python; the last two lines represent the output of the most common tokens. Not surprisingly, punctuation like the comma (,) and period are high on the list, as is the definite article the.

Listing 06. Output of the most_common.py script from Listing 5

$ python most_common.py 
 Introductory Examples for the NLTK Book 
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
(u',', 4885), (u'the', 4045), (u'.', 3828), (u'of', 2319), (u'to', 2164),
 (u'a', 1878), (u'in', 1572), (u'and', 1511)    

This last example of basic analytics looks at tagging. The NLTK includes a part-of-speech tagger that allows you to break a sentence into its lexical categories (noun, verb, etc.). This feature forms the basis of natural language processing by breaking language down into its constituent parts. You can use this functionality to develop question-answering applications and to perform sentiment analysis.

Listing 7 provides a simple script that uses the NLTK’s POS tagger. I read On the Origin of Species from Project Gutenberg and pick a random sentence. The sentence is tokenized to words and then tagged by a call to pos_tag. The script emits this result, which represents a set of tuples (word, tag).

Listing 07. POS tagging example

import nltk
import requests
import random
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk import pos_tag

#Read the "Origin of Species"
r = requests.get("https://www.gutenberg.org/cache/epub/1228/pg1228.txt");

#Tokenize sentences from the text 
sents = sent_tokenize( r.text )

sent = sents random.randint(0,len(sents)) 
print sent

words = word_tokenize( sent )

tagged = pos_tag( words )

print tagged

Listing 8 provides the output for the POS tagging example. The untagged sentence is emitted first, followed by the set of word/tag tuples. Note that the tags can be quite diverse, but in this example, they are personal pronoun (PRP), past-tense verb (VBD), preposition or conjunction (IN), determiner (DT), adverb (RB), adjective (JJ), noun-common-plural (NNS), and noun-proper-singular (NNP). You can view the entire set of tags in the NLTK through a call to nltk.help.upenn_tagset().

Listing 08. Output of the tagging script from Listing 7

$ python tagging.py 
He experimentised on some of the very same species as did Gartner.
(u'He', 'PRP'), (u'experimentised', 'VBD'), (u'on', 'IN'), (u'some', 'DT'),
 (u'of', 'IN'), (u'the', 'DT'), (u'very', 'RB'), (u'same', 'JJ'),
 (u'species', 'NNS'), (u'as', 'IN'), (u'did', 'VBD'), (u'Gartner', 'NNP'),
 (u'.', '.')    

Next, I dive into two more complex examples and their applications: Markov chains and sentiment analysis.

Markov chains

A Markov chain is a random model that describes a sequence of events where the probability of a given event depends on the state of the previous event. In other words, for a given state, one can predict the future state based on the current state. This sounds simple, and it’s surprisingly useful in a variety of applications, including speech recognition, reinforcement learning, prediction, DNA sequencing, and compression.

One type of Markov model is called an n-gram, which refers to a sequence of symbols (such as words in a sentence). A simple kind of n-gram is the bigram, which is an n-gram of size 2. If a bigram refers to a pair of adjacent symbols, a trigram refers to three adjacent symbols. Let’s look at how useful bigrams can be.

Consider the sample sentence, “I am Sam, Sam I am.” From this sentence (ignoring punctuation), you can generate five bigrams, starting with a word and including the next. The first word can be considered the current state; the second word represents the predicted next state (see the image below).

Figure 1. Sample bigram list and graph
Table showing a word followed by the next word in the series

So, if you begin at a random word and select the next word based on a probability (in this case, only Sam can lead to itself, with P(0.5) and I with P(0.5)), you can construct random sentences in the style of the sample corpus. Starting with am and limiting to three words, you could generate {'am', 'Sam', 'Sam'} or {'am', 'Sam', 'I'}. That’s not all that interesting, but now consider that you generate bigrams from an entire book. You would end up with thousands of bigrams and have the ability to generate more sensible sentences. Listing 9 shows two sample sentence constructions using bigrams from On the Origin of Species (as generated by the Python script in Listing 10).

Listing 09. Generated sentences from the On the Origin of Species

$ python markov.py 
 u'I', u'need', u'not', u‘support', u'my', u'theory', u'.'$ python markov.py 
u'I', u'think', u',', u'becoming', u'a', u'valid', u'argument', u'against', u'the', u'breeds', u'of', u'pigeons', u'.'    

I’ll split the markov.py script in Listing 10 into three parts. In the first part is the CreateTuples function, which accepts an array of words and breaks them down into tuples (or bigrams). These bigrams are added to a set and returned.

Part 2 is the sentence generator, which accepts a conditional frequency distribution (that is, a list of words and the words that follow the original word with their counts). I start with an initial word — in this case, I— and begin my sentence. I then iterate the CFD to generate the next word until that word is a period (representing the end of the sentence). I iterate the distribution of words following my target word and pick one randomly. This word is appended to my sentence, and the selected word becomes the target word (and I start the process again to find a subsequent word). The constructed sentence is then emitted.

In part 3, I perform my setup processing. I read On the Origin of Species from the Project Gutenberg website, tokenize the text into an array of words, then pass this array to CreateTuples to generate my bigrams. I use the NLTK’s nltk.ConditionalFreqDist to construct the CFD, and then pass this CFD to EmitSentence to generate a random sentence by using the generated bigrams as a probabilistic guide. Some of the sentences generated from the corpus are enlightening, but many can be long and nonsensical. Scaling my example from bigrams to trigrams increases the odds of meaningful sentences.

Listing 10. Sentence construction with the NLTK and bigrams (markov.py)

import nltk
import requests
import random
from nltk.tokenize import word_tokenize

def CreateTuples( words ):
   tuples = 
   for i in range( len(words)‑1 ):
      tuples.append( (words[i], words[i+1]) )

   return tuples

#Iterate the bigrams and construct a sentence until a '.' is encountered.
def EmitSentence( cfdist ):
   word = u'I'
   sentence =    sentence.append(word)

   while word != '.':
      options =       for gram in cfdist[word]:
         for result in range(cfdist[word][gram]):

      word = optionsint((len(options))*random.random())      sentence.append( word )

   print sentence

#Read the "Origin of Species"
r = requests.get("https://www.gutenberg.org/cache/epub/1228/pg1228.txt");

#Tokenize words from the text 
tokens = word_tokenize( r.text )

#Create the bigram word tuples
tuples = CreateTuples( tokens )

#Create a conditional frequency distribution based upon the tuples
cfdist = nltk.ConditionalFreqDist( tuples )

#Emit a random sentence based upon the corpus
EmitSentence( cfdist )

This approach has useful applications. At the level of letters, you can identify word misspellings and offer corrections. You can also use bigrams or trigrams to identify the author of a given work, given hidden signatures within the text (word choice, frequency, constructions).

Sentiment analysis

In my final example, I touch on the area of sentiment analysis. Sentiment analysis, or opinion mining, is the process of computationally identifying whether the writer’s attitude toward a piece of text was positive, negative, or neutral. This feedback can be useful, for example, in mining natural language reviews for opinions on a product or service.

The NTLK includes a simple rule-based model for sentiment analysis that combines lexical features to identify sentiment intensity. It’s also simple to use. Listing 11 shows the sample script that uses the NLTK sentiment analyzer. I import the necessary modules (including the Vader sentiment analyzer) and create a function that accepts a sentence and emits the sentiment classes. The function begins by instantiating the SentimentIntensityAnalyzer, then calling the polarity_scores method with the passed sentence. The result is a set of floats representing positive or negative valence for the input text. These floats are emitted for the four classes (negative; neutral; positive; and compound, which represents an aggregated score). The script ends with a call of the passed argument to identify sentiment.

Listing 11. Sentiment analysis with NLTK (sentiment.py)

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import sys

def IdentifySentiment( sentence ):

   sia = SentimentIntensityAnalyzer()
   ps = sia.polarity_scores( sentence )

   for i in ps:
      print('{0}: {1}, '.format(i, ps[i]))

IdentifySentiment( sys.argv[1] )

Now take a look at how the NLTK’s sentiment analyzer works for a set of sentences. Listing 12 provides some interactions with the script in Listing 11, with the output from the sentiment analyzer. The first sentence is obviously positive, with the second obviously negative. Both examples are properly classified. But the third example illustrates how sarcasm can be problematic for analysis (although it classified both positive and negative components of the sentence).

Listing 12. Sample output from the sentiment analyzer script in Listing 11

$ python sentiment.py "The meat was wonderfully seasoned and prepared."
neg: 0.0, 
neu: 0.463, 
pos: 0.537, 
compound: 0.7003,
$ python sentiment.py "The stench in the air was something to behold."
neg: 0.292, 
neu: 0.708, 
pos: 0.0, 
compound: ‑0.5106,
$ python sentiment.py "I love how you don't care how you come across."
neg: 0.19, 
neu: 0.506, 
pos: 0.304, 
compound: 0.3761,

Sentiment analysis has many applications, and given the growth of online content (blogs, tweets, and other information), the approach can be successfully used to mine opinions for real-time feedback.

Going further

Unstructured data can be difficult to use in the context of machine learning. Couple this with its preponderance in the wild (textual content online), and it’s easy to see why text is the new frontier for data mining. Using Python and the NLTK, it’s easy to see how you can use simple scripts for text analytics. The final tutorial in this series reviews the languages of data science, both for numerical and textual understanding.