Taxonomy Icon

Artificial Intelligence

This is the final part of a series that explores letter correlation and simple language statistics for AI. In Part 1 we cover letter correlation and simple language statistics for AI and in Part 2 we cover word analysis and N-grams in a variety of practical applications.

Going back to the metaphor from Part 1, imagine you had monkeys that use typewriters to produce text, and that the typewriters were rigged in such a way that the probability of them striking a particular key is connected to the expected frequency of the N-gram it would produce. Just to give one example, if the typewriter were derived from English trigram frequencies, and the monkey typed EQ, the probability that the monkey next types U should be much higher than any other option. Just to give one example, if the typewriter is derived from English trigram frequencies, and the monkey typed EQ, the probability that the monkey next types U should be much higher than any other option.

Capturing N-gram frequencies

We can simulate the monkey at a probabilistic typewriter in code through the power of random number generation utilities, which can select according to distributions. First, we need to capture the N-grams generated so far into a reusable form.

The following code is a variation on code from Part 1. It writes computed N-gram frequencies to a file.

Listing 1. storeletterngramcounts.py
import sys
import json
from nltk.probability import FreqDist

from nltk.util import ngrams
from nltk.tokenize import RegexpTokenizer

#Set up a tokenizer that captures only lowercase letters and spaces
#This requires that input has been preprocessed to lowercase all letters
TOKENIZER = RegexpTokenizer("[a‑z ]")


def countngrams(input_fp, frequencies, order, buffer_size=1024):
    '''Read the text content of a file and keep a running count of how often
    each bigram (sequence of two) characters appears.

    Arguments:
        input_fp ‑‑ file pointer with input text
        frequencies ‑‑ mapping from each bigram to its counted frequency
        buffer_size ‑‑ incremental quantity of text to be read at a time,
            in bytes (1024 if not otherwise specified)
        order ‑‑ The N in each N‑gram (i.e. number of items)

    Returns:
        nothing
    '''
    #Read the first chunk of text, and set all letters to lowercase
    text = input_fp.read(buffer_size).lower()
    #Loop over the file while there is text to read
    while text:
        #This step is needed to collapse runs of space characters into one
        text = ' '.join(text.split())
        spans = TOKENIZER.span_tokenize(text)
        tokens = (text[begin : end] for (begin, end) in spans)
        for bigram in ngrams(tokens, order):
            #Increment the count for the bigram. Automatically handles any
            #bigram not seen before. The join expression turns 2 separate 
            #single‑character strings into one 2‑character string
            if '  ' not in ''.join(bigram):
                frequencies[''.join(bigram)] += 1
        #Read the next chunk of text, and set all letters to lowercase
        text = input_fp.read(buffer_size).lower()

    return


if __name == '__main':
    #Initialize the mapping
    frequencies = FreqDist()
    #The order of the ngrams is the first command line argument
    ngram_order = int(sys.argv[1])
    #Pull the input data from the console
    count_ngrams(sys.stdin, frequencies, ngram_order)
    outputfp = open(sys.argv[2], 'w')
    json.dump(dict(frequencies), outputfp)
    print('Stored frequencies of {} encountered N‑grams.'.format(len(frequencies)))

Run the program as follows:

python store_letter_ngram_counts.py 3 oanc‑trigrams.json < oanc.txt

This results in a file, oanc-trigrams.json, with the trigram frequencies in JSON form. It prints to the console the number of distinct N-grams encountered, in this case:

Stored frequencies of 15833 encountered N‑grams.

I also generated 5-gram and 7-gram data sets to use later:

python store_letter_ngram_counts.py 5 oanc‑5grams.json < oanc.txt
  python store_letter_ngram_counts.py 7 oanc‑7grams.json < oanc.txt

Rigging the typewriters

Now comes the fun part. We need a separate program to read the N-gram frequencies from the file and use them to generate text. It needs to do some additional computation to determine how likely one letter is to occur after a known sequence of other letters. For example, when working with trigrams, it needs to compute how often E follows TH compared to how often A, Z, or any other letters do. The logic involved can be a little confusing, so I thought I should start with a diagram that shows a few simplified steps in the process.

Simplified process for generating text from trigram frequencies

There are several simplifications made. First, we assume that the process has already generated the leading text “TH.” It just has to figure out which letter comes next. The diagram only shows three options per step, rather than the full 27 (letters plus space). It illustrates with some mocked-up frequencies. If in the training model it found 100 instances of the trigram “THA,” 350 of “THE,” and 50 of “THR,” then to follow that model, it would have a 70-percent chance of selecting “E” as the next letter, completing the “THE” trigram.

In the second step it chooses a space character, the 73-percent option. In the diagram, space characters are displayed as underscores for clarity.

Of course, it is a random selection in each step, and it won’t always pick the highest-probability item. In the third step, it picks the very low-probability “Q” to complete the trigram “T Q.” Pretend that it goes on like this to produce “THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.”

I hope you get a picture of what’s going on in this process of generating text.

Markov process

This generation of text is an example of what is called a Markov process or Markov chain. At each point, the context or lead (the most recent two items generated) make up a current state. There are multiple transitions possible to the next state, one of which is selected with a probability based on the frequency of the N-gram it represents. At that point, the new state is the final item in the lead plus the new item generated.

The combination of state with random selection of transitions based on probability is a Markov process. It’s a very important concept in computer science, but I won’t be delving too deeply into the theoretical details in this tutorial. If you are interested, I recommend Setosa’s visual and interactive explainer.

Getting the ball rolling

The basic process of text generation is easily explained, but there are many tricky areas that are very sensitive to the strategies you select in implementation. One such area is how you start the process. What is the initial state? In the trigram example illustrated previously, the initial state is arbitrarily chosen as “TH.” You could decide to always use the same initial state, ideally one with many possible trigram completions, but observers of the output will notice the fixed pattern at the start.

Another possible strategy is to select N-1 initial characters at random, based on the overall frequency of each letter. The following code uses this strategy to generate text. I’ve added liberal comments to help you follow along.

Listing 2. generatelettersstrategy1.py
import sys
import json
import string
import random

population = ' ' + string.asciilowercase

def preprocessfrequencies(frequencies, order):
    '''Compile simple mapping from N‑grams to frequencies into data structures to help compute
    the probability of state transitions to complete an N‑gram

    Arguments:
        frequencies ‑‑ mapping from N‑gram to frequency recorded in the training text
        order ‑‑ The N in each N‑gram (i.e. number of items)

    Returns:
        sequencer ‑‑ Set of mappings from each N‑1 sequence to the frequency of possible
            items completing it
        item_freqs ‑‑ individual frequency of each item in the training text
    '''
    sequencer = {}
    item_freqs = {}
    for ngram in frequencies:
        #Separate the N‑1 lead of each N‑gram from its item completions
        freq = frequenciesngram        lead = ngram:‑1        final = ngram‑1        sequencer.setdefault(lead, {})
        sequencer[lead][final] = freq
        #Accumulate the overall frequency of each item
        for c in ngram:
            item_freqs.setdefault(c, 0)
            item_freqs[c] += freq
    return sequencer, item_freqs


def generate_letters(sequencer, item_freqs, length, order):
    '''Generate text based on probabilities derived from statistics for initializing
    and continuing sequences of letters

    Arguments:
        sequencer ‑‑ mapping from each leading sequence to frequencies of the next letter
        item_freqs ‑‑ mapping from each item to its overall frequency regardless of sequence
        length ‑‑ approximate number of characters to generate before ending the program
        order ‑‑ The N in each N‑gram (i.e. number of items)

    Returns:
        nothing
    '''
    #The lead is the initial part of the N‑Gram to be completed, of length N‑1
    #containing the last N‑1 items produced
    lead = ''
    #Keep track of how many items have been generated
    generated_count = 0
    #Turn item frequencies into weights for selection probability
    item_weights =  item_freqs.get(c, 0) for c in population     while generated_count < length:
        #This condition will be true until the initial lead N‑gram is constructed
        #It will also be true if we get to a dead end where there are no stats
        #For the next item from the current lead
        if lead not in sequencer:
            #Returns a list, length 1. Extract that one item.
            chosen = random.choices(population, weights=item_weights)0            #end='' prevents a newline from being printed after each letter
            #flush=True forces output to be displayed right away, not buffered
            print(chosen, end='', flush=True)
            #If needed to accommodate the new item, clip the first item from the lead
            if len(lead) == order:
                lead = lead1:            #Tack on the new item
            lead += chosen
            generated_count += 1
        else:
            freq = sequencerlead            weights =  freq.get(c, 0) for c in population             chosen = random.choices(population, weights=weights)0            print(chosen, end='', flush=True)
            #Clip the first item from the lead and tack on the new item
            lead = lead[1:] + chosen
            generated_count += 1

    return


if __name == '__main':
    #File with N‑gram frequencies is the first argument
    raw_freq_fp = open(sys.argv[1])
    length = int(sys.argv[2])
    raw_freqs = json.load(raw_freq_fp)

    #Figure out the N‑gram order. Just pull the first N‑gram and check its length
    order = len(next(iter(raw_freqs)))
    sequencer, item_freqs = preprocess_frequencies(raw_freqs, order)
    generate_letters(sequencer, item_freqs, length, order)

Run the program as follows:

python generate_letters_strategy1.py oanc‑trigrams.json 500

The first command-line argument is the JSON file of the N-grams, and the second is the length of the text sequence to generate. You’ll get different output each time you do this, of course. The first time I ran it, I got the following output.

Sample output of generation from trigram frequencies

As you can see, it looks like a weird patch-up of English, with a few runs of actual words, “by the of” and “right thigh.”

Limitations of the strategy

Let’s see how the results improve by using higher-order N-grams. First fifth order:

python generate_letters_strategy1.py oanc‑5grams.json 500

The first time I got the following.

Sample output of generation from 5-gram frequencies

This is exciting. It’s now unmistakably English, with the occasional flourish in style, starting almost right away with “echoes so help meet the locations that a.”

Incidentally, you might notice this time that it’s slower to get going. This is because the higher order the N-grams the more work that needs to be done to initialize the transition probabilities. This effect is exponential with the order, so working with 7-grams will be much slower again. Luckily, the hardest bit only needs to be done once, so if you have code that is used to generate text in long-running sessions, the delay will only be in launching the program in the first place.

Unfortunately, the excitement of English-like output was short-lived. On my third run, I got the following.

sample output of generation from 5-gram frequencies

This is even less coherent than the trigram-based example.

It became pretty clear that a large part of the problem is the initialization strategy. The first two letters “TN” are each common enough to have a high chance in the initial selection, but as a lead to find further 5-grams, this rare combination fails, and the program never really finds an equilibrium of 5-grams to launch from.

A new initialization strategy

What if, instead of selecting random individual letters to initialize, we kept a list of, say the 10,000 most common N-grams and always started with a random selection of one of these. This would give the routine a strong push from the start every time.

The following code uses this new strategy to generate text. Lines that are changed from the previous listing are highlighted.

Listing 3. generatelettersstrategy2.py
import sys
import json
import string
import random

POPULARNGRAMCOUNT = 10000

#Population is all the possible items that can be generated
population = ' ' + string.ascii_lowercase

def preprocess_frequencies(frequencies, order):
    '''Compile simple mapping from N‑grams to frequencies into data structures to help compute
    the probability of state transitions to complete an N‑gram

    Arguments:
        frequencies ‑‑ mapping from N‑gram to frequency recorded in the training text
        order ‑‑ The N in each N‑gram (i.e. number of items)

    Returns:
        sequencer ‑‑ Set of mappings from each N‑1 sequence to the frequency of possible
            items completing it
        popular_ngrams ‑‑ list of most common N‑grams
    '''
    sequencer = {}
    ngrams_sorted_by_freq = 
        k for k in sorted(frequencies, key=frequencies.get, reverse=True)
        popular_ngrams = ngrams_sorted_by_freq:POPULAR_NGRAM_COUNT    for ngram in frequencies:
        #Separate the N‑1 lead of each N‑gram from its item completions
        freq = frequenciesngram        lead = ngram:‑1        final = ngram‑1        sequencer.setdefault(lead, {})
        sequencer[lead][final] = freq
    return sequencer, popular_ngrams


def generate_letters(sequencer, popular_ngrams, length, order):
    '''Generate text based on probabilities derived from statistics for initializing
    and continuing sequences of letters

    Arguments:
        sequencer ‑‑ mapping from each leading sequence to frequencies of the next letter
        popular_ngrams ‑‑ list of the highest frequency N‑Grams
        length ‑‑ approximate number of characters to generate before ending the program
        order ‑‑ The N in each N‑gram (i.e. number of items)

    Returns:
        nothing
    '''
    #The lead is the initial part of the N‑Gram to be completed, of length N‑1
    #containing the last N‑1 items produced
    lead = ''
    #Keep track of how many items have been generated
    generated_count = 0
    while generated_count < length:
        #This condition will be true until the initial lead N‑gram is constructed
        #It will also be true if we get to a dead end where there are no stats
        #For the next item from the current lead
        if lead not in sequencer:
            #Pick an N‑gram at random from the most popular
            reset = random.choice(popular_ngrams)
            #Drop the final item so that lead is N‑1
            lead = reset:‑1            for item in lead:
                print(item, end='', flush=True)
            generated_count += len(lead)
        else:
            freq = sequencerlead            weights =  freq.get(c, 0) for c in population             chosen = random.choices(population, weights=weights)0            print(chosen, end='', flush=True)
            #Clip the first item from the lead and tack on the new item
            lead = lead[1:] + chosen
            generated_count += 1

    return


if __name == '__main':
    #File with N‑gram frequencies is the first argument
    raw_freq_fp = open(sys.argv[1])
    length = int(sys.argv[2])
    raw_freqs = json.load(raw_freq_fp)

    #Figure out the N‑gram order. Just pull the first N‑gram and check its length
    order = len(next(iter(raw_freqs)))
    sequencer, popular_ngrams = preprocess_frequencies(raw_freqs, order)
    generate_letters(sequencer, popular_ngrams, length, order)

This time I ran the program dozens of times and never had it deviate too far from the English-looking.

While we’re here, let’s have a look at what the 7-gram-based text generator produces. The following shows my first run.

sample output of generation from 5-gram frequencies

It’s now starting to feel like something that should really be intelligible, but just not quite coherent. The ending, “left the documentation…” feels like some sort of technical text in an unfamiliar field.

Just to give a sense of the many other things you can tweak, you might consider introducing a small random chance that the program will output a letter that is not based on one of the N-grams in the training corpus. There are many ways to subtly tweak this basic text-generation program to change its behaviors and characteristics.

Preparing to work with words

As you saw from the 7-gram output example, this text generation based on letter correlation approaches intelligibility, but there’s a subtle linguistic barrier that’s really hard to probe until you start working at the granularity of words. In Part 2, I showed how to gather word-correlation statistics. The following code stores them in preparation for word generation.

Listing 4. storewordngramcounts.py
import sys
import json
from nltk.probability import FreqDist

from nltk.util import ngrams
from nltk.tokenize import RegexpTokenizer

#Set up a tokenizer that only captures words
#Requires that input has been preprocessed to lowercase all letters
TOKENIZER = RegexpTokenizer("[a‑z]+")


def countngrams(input_fp, frequencies, order, buffer_size=1024):
    '''Read the text content of a file and keep a running count of how often
    each bigram (sequence of two) characters appears.

    Arguments:
        input_fp ‑‑ file pointer with input text
        frequencies ‑‑ mapping from each bigram to its counted frequency
        order ‑‑ The N in each N‑gram (i.e. number of items)
        buffer_size ‑‑ incremental quantity of text to be read at a time,
            in bytes (1024 if not otherwise specified)

    Returns:
        nothing
    '''
    #Read the first chunk of text, and set all letters to lowercase
    text = input_fp.read(buffer_size).lower()
    #Loop over the file while there is text to read
    while text:
        #This step is needed to collapse runs of space characters into one
        text = ' '.join(text.split())
        spans = TOKENIZER.span_tokenize(text)
        tokens = (text[begin : end] for (begin, end) in spans)
        for ngram in ngrams(tokens, order):
            #Join ngrams into a single space separated string
            ngram_text = ' '.join(ngram)
            #Extra layer to make sure no multiple runs of spaces sneak through
            ngram_text = ' '.join(ngram_text.split())
            frequencies[ngram_text] += 1
        #Read the next chunk of text, and set all letters to lowercase
        text = input_fp.read(buffer_size).lower()

    return


if __name == '__main':
    #Initialize the mapping
    frequencies = FreqDist()
    #The order of the ngrams is the first command‑line argument
    ngram_order = int(sys.argv[1])
    #Pull the input data from the console
    count_ngrams(sys.stdin, frequencies, ngram_order)
    outputfp = open(sys.argv[2], 'w')
    json.dump(dict(frequencies), outputfp)
    print('Stored frequencies of {} encountered N‑grams.'.format(len(frequencies)))

Run the program as follows:

python store_word_ngram_counts.py 3 oanc‑3word‑grams.json < oanc.txt

I also generated 5-gram and 7-gram data sets to use later:

python store_word_ngram_counts.py 5 oanc‑5word‑grams.json < oanc.txt
  python store_word_ngram_counts.py 7 oanc‑7word‑grams.json < oanc.txt

There are many more distinct N-grams when you’re working at the word level: 8,724,842 trigrams, 13,467,874 5-grams, and 13,897,093 7-grams.

Generating text as N-grams of words

The following code is a program to take the stored statistics and use them to generate text.

Listing 5. storewordngramcounts.py
import sys
import json
import string
import random

POPULARNGRAM_COUNT = 10000


def preprocess_frequencies(frequencies, order):
    '''Compile simple mapping from N‑grams to frequencies into data structures to help compute
    the probability of state transitions to complete an N‑gram

    Arguments:
        frequencies ‑‑ mapping from N‑gram to frequency recorded in the training text
        order ‑‑ The N in each N‑gram (i.e. number of items)

    Returns:
        sequencer ‑‑ Set of mappings from each N‑1 sequence to the frequency of possible
            items completing it
        popular_ngrams ‑‑ list of most common N‑grams
        all_words ‑‑ list of all unique words that occur in the training text
    '''
    sequencer = {}
    ngrams_sorted_by_freq = 
        k for k in sorted(frequencies, key=frequencies.get, reverse=True)
        popular_ngrams = ngrams_sorted_by_freq:POPULAR_NGRAM_COUNT    all_words = set()
    for ngram in frequencies:
        #Separate the N‑1 lead of each N‑gram from its item completions
        freq = frequenciesngram        words = ngram.split()
        lead = words:‑1        final = words‑1        #A rule of Python means we must convert a list to a tuple before using it as
        #A mapping key
        sequencer.setdefault(tuple(lead), {})
        sequencer[tuple(lead)][final] = freq
        for word in words:
            all_words.add(word)
    #We used the set data structure to keep the words unique
    #But the way we need to use it, we must convert to list
    all_words = list(all_words)
    return sequencer, popular_ngrams, all_words


def generate_letters(sequencer, popular_ngrams, all_words, length, order):
    '''Generate text based on probabilities derived from statistics for initializing
    and continuing sequences of letters

    Arguments:
        sequencer ‑‑ mapping from each leading sequence to frequencies of the next letter
        popular_ngrams ‑‑ list of the highest frequency N‑Grams
        length ‑‑ approximate number of characters to generate before ending the program
        order ‑‑ The N in each N‑gram (i.e. number of items)

    Returns:
        nothing
    '''
    #The lead is the initial part of the N‑Gram to be completed, of length N‑1
    #containing the last N‑1 items produced
    lead =     #Keep track of how many items have been generated
    generated_count = 0
    while generated_count < length:
        #This condition will be true until the initial lead N‑gram is constructed
        #It will also be true if we get to a dead end where there are no stats
        #For the next item from the current lead
        if tuple(lead) not in sequencer:
            #Pick an N‑gram at random from the most popular
            reset = random.choice(popular_ngrams)
            #Split it up into a list to server as a lead
            reset = reset.split()
            #Drop the final item so that lead is N‑1
            lead = reset:‑1            for item in lead:
                #Note we now print a space between items, which are words
                print(item, end=' ', flush=True)
            generated_count += len(lead)
        else:
            freq = sequencertuple(lead)            weights =  freq.get(w, 0) for w in all_words             chosen = random.choices(all_words, weights=weights)0            print(chosen, end=' ', flush=True)
            #Clip the first item from the lead and tack on the new item
            lead = lead[1:] + chosen            generated_count += 1

    return


if __name == '__main':
    #File with N‑gram frequencies is the first argument
    raw_freq_fp = open(sys.argv[1])
    length = int(sys.argv[2])
    raw_freqs = json.load(raw_freq_fp)

    #Figure out the N‑gram order. Just pull the first N‑gram and check how many words in it
    order = len(next(iter(raw_freqs)).split())
    sequencer, popular_ngrams, all_words = preprocess_frequencies(raw_freqs, order)
    generate_letters(sequencer, popular_ngrams, all_words, length, order)

Run the program as follows:

python generate_words.py oanc‑trigrams.json 100

The second command-line argument is still the length of the text sequence to generate, but this time counted in words, not letters. The results are very interesting, of course. From the trigrams:

sample output of generation from trigram word frequencies

From 5-grams:

sample output of generation from 5-gram word frequencies

From 7-grams:

sample output of generation from 7-gram word frequencies

I’ll just leave you to marvel, as I did, at the final couple of lines of the 5-gram generated text and all of the 7-gram text.

Conclusion

I must stress again just how easily it is to make small changes in the workings of these programs to get very interesting variations. It’s an area where experimentation and exploration in code pays off greatly.

You have learned how to get AI to do more than just classify things, including generating patterns, and in particular, language based on model text. You can use such N-gram generation techniques to produce all sorts of patterns, including images and sound. After you start to play around with generative AI techniques, you will find myriad possibilities opening up for taking your work in AI to the next level.