In 1954, Georgetown University and IBM demonstrated the ability to translate a few Russian sentences into English. Half a century later, IBM developed a question-answering system called IBM Watson that defeated even the best Jeopardy players. Natural language processing (NLP), a hallmark of IBM Watson, remains one of the most important applications of machine learning because it represents the most natural interface between human and machine. This tutorial explores some of the primary methods that are used for NLP, such as deep learning (neural networks). It also demonstrates NLP using open source libraries.
NLP is an umbrella over several language-related fields. To illustrate, consider the diagram in Figure 1. In an abstract question-and-answer system, there is translation of audio into text, breaking down the text into a structure that machines can use; a discourse engine that maintains state and an interface to some knowledge source; the generation of an answer; and finally, synthesis back to speech audio.
You can break down these elements of an abstract Q&A system even further. In NLP, text is broken down by syntax into its composite parts, and then assigned some meaning through semantics. In natural language generation, the system must reconstruct an abstract concept (to be communicated to the user) in a response (given the rules of sentence structure for the wanted language).
But, Q&A is only one aspect of NLP. NLP is applied in many areas outside discourse, as well. Two examples are sentiment analysis (determining affect from a sentence or document) and summarization (where the system creates a summary from a body of text). Let’s take a quick survey of NLP’s history, and then dig into the details.
NLP was one of the earliest research targets for strong artificial intelligence because it’s the natural interface between humans and machines (Figure 2). The earliest implementation of NLP was part of the 1954 Georgetown experiment, a joint project between IBM and Georgetown University that successfully demonstrated the machine translation of more than 60 Russian sentences into English. The researchers accomplished this feat by using hand-coded language rules, but the system failed to scale to general translation.
The 1960s saw research into “micro-worlds,” which developed simulated worlds and NLP to query and manipulate objects in those worlds. One famous example was Terry Winograd’s SHRDLU, which used NLP to change the state of a virtual sandbox that contained shapes, and then query the state of the world through English (“Can a pyramid be supported by a block?”). SHRDLU demonstrated not only NLP but also planning to carry out requests such as “Clear the table, and place the red block on the blue block.” Other developments included the construction of Eliza, a chatterbot that simulated a psychotherapist.
The 1970s brought new ideas into NLP, such as building conceptual ontologies (machine-usable data). This work continued into the 1980s, where researchers developed hardcoded rules and grammars to parse language. These methods proved to be brittle but were optimal methods, given the computational resources available at the time.
It wasn’t until the late 1980s that statistical models came into play. Rather than complex rules that tended to be brittle, statistical models used existing textual corpora (documents and other information) to build models for how people use language. Researchers applied hidden Markov models (a method for creating probabilistic models for linear sequences) to part-of-speech tagging to disambiguate the meaning behind word choices in speech (given the many ambiguities that exist in language). Statistical models broke through the complexity barrier of hand-coded rules by creating them through automatic learning.
Today, deep learning has raised the bar in many NLP tasks. Recurrent neural networks (which differ from feed-forward networks in that they can be self-referential) have been successfully applied to parsing, sentiment analysis, and even natural language generation (in concert with image-recognition networks).
Parsing a sentence
Now, let’s look at the processing of an English sentence and the pipeline of tasks that are used to break it down. I’ll parse a simple sentence of four words (“The other boy runs.”), and then illustrate it in Figure 3.
The first step in parsing is to tokenize the sentence—that is, simply breaking down the sentence into its individual parts (or tokens). The tokens that make up my simple sentence are The, other, boy, runs, and the ending period (.). Tokenization yields the complete set of individual words that make up the sentence.
The next step is called stop word removal. The goal of stop word removal is to remove commonly used words in the language to permit focus on the important words in the sentence. There is no single definition of the stop word set, but there are common words that are easily removed.
After removing the stop words, I focus on removing punctuation. Punctuation in this context refers not only to commas and periods but also to the variety of special symbols used (parentheses, apostrophes, quotation marks, exclamation points, and so on).
Now that I’ve cleaned up my sentence, I’ll focus on the process of lemmatization (also called stemming). The goal of lemmatization is to reduce the words to their stem, or root form. For example, walking would be reduced to walk. In some cases, the algorithm changes the word choice to use the correct lemma (for example, changing better to good). In this example, I reduce runs to its root form, run.
The final phase in the parse is called part-of-speech (POS) tagging. In this process, I mark up the words as they correspond to a part of speech based on their context. I identify my remaining word tokens that correspond to a determiner, a noun, and a verb.
Those weren’t complex steps, but now let’s look at how this process occurs when performed automatically.
Parsing with NLTK
One of the most popular platforms for NLP is the Natural Language Toolkit (NLTK) for Python. Using the sentence parsing pipeline in Figure 3, I’ll use NLTK on a slightly more complex sentence. In this example, I use the opening line from William Gibson’s Neuromancer. I loaded the line into a string within Python (note here that
>>> is the Python prompt).
>>> sentence = "The sky above the port was the color of television, tuned to a dead channel."
I then tokenize the sentence using NLTK’s word tokenizer and emit the tokens.
>>> tokens = nltk.word_tokenize( sentence ) >>> print tokens ['The', 'sky', 'above', 'the', 'port', 'was', 'the', 'color', 'of', 'television', ',', 'tuned', 'to', 'a', 'dead', 'channel', '.']
With the sentence tokenized, I can remove the stop words by creating a set of stop words for English, and then filtering the tokens from this set.
>>> stop_words = set( stopwords.words( "english" ) ) >>> filtered = [ word for word in tokens if not word in stop_words ] >>> print filtered [ 'The', 'sky', 'port', 'color', 'television', ',', 'tuned', 'dead', 'channel', '.' ]
Next, I remove any punctuation from my filtered list of tokens. I create a simple set of punctuation, and then filter the list one more time.
>>> punct = set( [ ",", "." ] ) >>> clean = [ word for word in filtered if not word in punct ] >>> print clean [ 'The', 'sky', 'port', 'color', 'television', 'tuned', 'dead', 'channel' ]
Finally, I perform POS tagging to the cleansed list of tokens. The result is a set of token and tag pairs, with the tag indicating the word class.
>>> parsed = nltk.pos_tag( clean ) >>> print parsed [ ('The', 'DT'), ('sky', 'NN'), ('port', 'NN'), ('color', 'NN'), ('television', 'NN'), ('tuned', 'VBN'), ('dead', 'JJ'), ('channel', 'NN') ]
That’s a small subset of the capabilities of NLTK. With NLTK, you also have a collection of corpora that you can easily use to experiment with NLTK and its capabilities.
The problem with parsing text based on hand-coded grammars is that the rules can be quite brittle. But, rather than rely on brittle hand-coded rules, what if the rules could be learned from sample texts? This is where statistical methods come into play. One interesting advantage of statistical methods is that they can operate on previously unseen input or input that includes errors. Gracefully handling these kinds of issues (operating on new never-before-seen input) is not generally possible using grammars.
Let’s explore some of the statistical methods that NLTK provides. First, I’ll import a sample corpus that I’ll then use within NLTK.
>>> import nltk >>> import nltk.book import *
I imported this NLTK into Python, and then imported the nine sample texts. For this example, I use the Text4 corpus, which is the “Inaugural Address Corpus.”
>>> text4 <Text4: Inaugural Address Corpus>
I can identify the frequency distribution of a text easily by using the
FreqDist method. This method gives me the distribution of words within a corpus. Building a frequency distribution is a common task, and NLTK makes it easy. With the distribution created, I use the
most_common method to emit the 15 most common words (with their number of appearances in the corpus). I can view the total number of words (and symbols) by using the
len method, which tells me that the word “the” makes up more than 6% of the text.
>>> fdist = FreqDist( text4 ) >>> fdist.most_common( 15 ) [ (u’the’, 9281), (u’of’, 6970), (u’,’, 6840), (u’and’, 4991), (u’.’, 4676), (u’to’, 4311), (u’in’, 2527), (u’a’, 2134), (u’our’, 1985), (u’that’, 1688), (u’be’, 1460), (u’is’, 1403), (u’we’, 1141), (u’for’, 1075), (u’by’, 1036)] >>> len( text4 ) 145735 >>> print 100.0 * 9281.0 / len( text4 ) 6.36840841253
NLTK easily identifies the most common contexts in which a word is used in a corpus. Using the
common_contexts method, I can provide a list of words and find their contexts (the following example indicates that “cloaked” was found in the context “contempt cloaked in”).
>>> text4.common_contexts( [ “cloaked” ] ) contempt_in
Finally, let’s look at an important aspect of understanding a corpus: sequences of words that occur often. The concept of a bigram (that is, pairs of words that occur together in a text) comes into play here, but a collocation is the subset of bigrams that occur unusually often. Using the
collocations method, I can extract a set of common word pairs that occur frequently together in a given text.
>>> text4.collocations() United States; fellow citizens; four years; years ago; Federal Government; General Government; American people; Vice President; Old World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; every citizen; Indian tribes; public debt; one another; foreign nations; political parties
From this set of collocations, it’s easy to see that these word pairs depend strongly on the corpus. Consider another example using the Monty Python and the Holy Grail movie as the input corpus.
>>> text7.collocations() BLACK KNIGHT; clop clop; HEAD KNIGHT; mumble mumble; Holy Grail; squeak squeak; FRENCH GUARD; saw saw; Sir Robin; Run away; CARTOON CHARACTER; King Arthur; Iesu domine; Pie Iesu; DEAD PERSON; Round Table; clap clap; OLD MAN; dramatic chord; dona eis
Bigrams (or n-grams in general, where a bigram represents n=2) can be useful in understanding word usage and sequence within a text or to assess the probability of a word occurring (in the face of noise). This method has also been used to support identification of misspelled words or plagiarized text. N-grams have even been used to generate text as trained from the n-grams of a given corpus.
Deep learning has become the most interesting area of study in machine learning and has been applied to object recognition in images or video (such as faces) and even summarizing images with natural language based on trained samples. Researchers have applied deep learning in the area of NLP, as well.
A key method of using NLP falls under the recurrent neural network approach, in which a network includes a pipeline based on time or a network with self-referential elements. A key architecture is the long short-term memory (LSTM) network, which uses a novel architecture in which memory cells make up the processing cells. Each memory cell includes an input and an input gate (which determines when new input can be incorporated); output and output gate (to determine when the output is fed forward); and a forget gate, which determines when the current memory cell state is forgotten to allow new input. This general architecture is shown in Figure 4, which represents a single layer of cells with input, output, and the ability to feed cell state through to other cells (representing time or sequence).
You can train LSTMs on input (using supervised learning to modify the cells based on the error of the output) with the
backpropagation-through-time algorithm, a type of back-propagation applicable to recurrent networks. Note that for NLP, words don’t represent themselves but instead are represented numerically or through word vectors, which map words into a 1,000-dimensional space in which dimensions can represent tense, singular versus plural, gender, and so on. Mapping words into word vectors created the ability to perform math operations on words, with surprising results. One famous example is where the word Queen very closely equates to the operation
King + Woman – Man. Logically, this makes sense, but it also works mathematically using word vectors.
The power of LSTMs is realized not in small networks but in vertically deep networks, which increase their memory, and their horizontally deep networks, which increase their representational capability. This is illustrated in Figure 5.
The sizing of LSTM networks is based on the size of the vocabulary on which they must operate.
IBM Watson NLP APIs
The complexity of these algorithms for NLP can be significant, so it’s not surprising that you can now conduct NLP with IBM Watson through a set of APIs. The same APIs that helped defeat players on the television game show Jeopardy are available through a set of REST services.
The IBM Watson APIs expose a range of functionality, including a conversation API that you can use to add a natural language interface to an application and APIs that can classify or translate language. IBM Watson can even translate speech to text or text to speech through APIs.
I’ve discussed several applications for NLP here, but the applicability of NLP is actually quite large and diverse. Automated essay scoring is a useful application in the academic arena, along with grammar checking, which is part of many word processors (including text simplification). NLP within foreign language learning is also popular, helping students understand text in another language or checking human-generated text in another language (which includes automated language translation). Finally, natural language search and information retrieval are useful aspects of NLP, particularly considering the growth of multimedia data that can be mined.
NLP is also quite useful in text analytics (otherwise known as text mining), which consists of task features like word frequency, determining collocations (words that occur together), and bigrams and distribution of word lengths. These features can be useful in determining textual complexity or even signature analysis to identify the author.
The future is deep learning
NLP has shown growth over the last 60 years from hand-coded grammars and rules, but the science has taken a significant technical leap through the use of deep learning. NLP represents the most natural interface for humans, so the journey to bring this model of communication to machines was necessary. Today, you can find NLP, along with speech recognition and synthesis, in everyday devices. Deep learning continues to evolve, and it will bring more new advances to this field.