We are lucky enough to live at a time when market forces have pushed the price of memory, disk, and even CPU capacity to formerly inconceivable lows. At the same time, however, booming applications such as big data, AI, and cognitive computing are pushing our requirements for these resources upward at a dizzying rate. There is some irony that at a time when computing resources are plentiful, it’s becoming even more important for developers to understand how to scale down their consumption to remain competitive.

The main reason Python has remained such a popular programming language for almost two decades is that it is so easy to learn. Within an hour, you can learn how easy lists and dictionaries are to manipulate. The bad news is that the naive approach to solving many problems with lists and dictionaries can quickly get you into trouble trying to scale your app, because without care, Python tends to be a bit more resource hungry than other programming languages.

The good news is that Python has some useful features to facilitate more efficient processing. At the foundation of many of these features is Python’s iterator protocol, which is the main topic of this tutorial. The full series of four tutorials will build on this to show you how to process large data sets efficiently with Python.

You should be familiar with the basics of Python, such as conditions, loops, functions, exceptions, lists, and dictionaries. This tutorial series focuses on Python 3; to run the code, you need Python 3.5 or a more recent version.

Iterators

Most likely, your earliest exposure to Python loops was code like the following:


for ix in range(10):
    print(ix)

Python’s for statement operates on what are called iterators. An iterator is an object that can be invoked over and over to produce a series of values. If the value after the in keyword is not already an iterator, for tries to convert it to an iterator. The built-in range function is an example of one that can be converted to an iterator. It produces a series of numbers, and the for loop iterates over these items, assigning each in turn to the variable ix.

It’s time to deepen your understanding of Python by taking a closer look at iterators like range. Enter the following in a Python interpreter:


r = range(10)

You have now initialized a range iterator, but that’s all. Go ahead and ask it for its first value. You ask an iterator for a value in Python by using the built-in next function.

>>> r = range(10)
>>> print(next(r))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'range' object is not an iterator

This exception indicates that you have to convert the object to an iterator before you use it as an iterator. You can do this using the built in iter function.


r = iter(range(10))
print(next(r))

This time it prints 0, as you might expect. Go ahead and enter print(next(r)) again and it will print 1, and so on. Keep on entering this same line. At this point, you should be grateful that on most systems, you can just press the Up arrow on the Python interpreter to retrieve the most recent command, then press Enter to execute it again, or even tweak it before pressing Enter, if you like.

In this case, you’ll eventually get to something like the following:

>>> print(next(r))
9
>>> print(next(r))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

We only asked for a range of 10 integers, so we fall off the end of that range after it has produced 9. The iterator doesn’t immediately do anything to indicate it has come to an end, but any subsequent next() calls will raise the StopIteration exception. As with any exception, you can choose to write your own code to handle it. Try the following code after the iterator r has been used up.

try:
    print(next(r))
except StopIteration as e:
    print("That's all folks!")

It prints the message “That’s all folks!” The for statement uses the StopIteration exception to determine when to exit the loop.

Other iterables

A range is only one sort of object that can be converted to an iterator. The following interpreter session demonstrates how a variety of standard types are interpreted as iterators.

>>> it = iter([1,2,3])
>>> print(next(it))
1
>>> it = iter((1,2,3))
>>> print(next(it))
1
>>> it = iter({1: 'a', 2: 'b', 3: 'c'})
>>> print(next(it))
1
>>> it = iter({'a': 1, 'b': 2, 'c': 3})
>>> print(next(it))
a
>>> it = iter(set((1,2,3)))
>>> print(next(it))
1
>>> it = iter('xyz')
>>> print(next(it))
x

It’s quite straightforward in the case of a list or tuple. A dictionary iterates over just its keys, and of course, no order is guaranteed. Order of iteration isn’t guaranteed in the case of sets either, even though in this case, the first item from the iterator happened to be the first item in the tuple that was used to construct the set. A string iterates over its characters. All such objects are called iterables.

As you can imagine, not every Python object can be converted to an iterator.

>>> it = iter(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'int' object is not iterable
>>> it = iter(None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'NoneType' object is not iterable

The best part is, of course, that you can make your own iterator types. All you need to do is define a class with certain specially named methods. Doing so is out of the scope of this tutorial series, but that’s OK because the most straightforward way to create your own custom iterator is not a special class, but a special function called a generator function. I discuss this next.

Generators

You’re used to the idea of a function, which takes some arguments and returns with a value, or with None. There can be more than one possible exit points, return statements, or just the last indented line of the function, which is the same thing as return None, but each time the function runs, only one of these exit points is selected, based on conditions in the function.

A generator function is a special type of function that interacts in a more complex, but useful way with the code that invokes it. Here is a simple example that you can paste into your interpreter session:

def gen123():
    yield 2
    yield 5
    yield 9

This is automatically a generator function because it contains at least one yield statement in its body. This one subtle distinction is the only thing that turns a regular function into a generator function, which is a bit tricky because there is a huge difference between regular and generator functions.

Call a generator function like any other function:


>>> it = gen123()
>>> print(it)
<generator object gen123 at 0x10ccccba0>

This function call returns right away, and not with a value specified in the function body. Calling a generator function always returns what is called a generator object. A generator object is an iterator that produces values from the yield statements in the generator function body. In standard terminology, a generator object yields a series of values. Let’s dig into the generator object from the previous code snippet.


>>> print(next(it))
2
>>> print(next(it))
5
>>> print(next(it))
9
>>> print(next(it))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Each time you call next() on the object, you get back the next yield value, until there are no more, in which case you get the StopIterator exception. Of course, because it is an iterator, you can use it in a for loop. Just remember to create a new generator object, because the first one has been exhausted.


>>> it = gen123()
>>> for ix in it:
...     print(ix)
... 
1
2
3
>>> for ix in gen123():
...     print(ix)
... 
1
2
3

Generator function arguments

Generator functions accept arguments, and these get passed into the body of the generator. Paste in the following generator function.


def gen123plus(x):
    yield x + 1
    yield x + 2
    yield x + 3

Now try it with different arguments, for example:


>>> for ix in gen123plus(10):
...     print(ix)
... 
11
12
13

When you are iterating over a generator object, the state of its function is suspended and resumed as you go, which introduces a new concept with Python functions. You can now in effect run code from multiple functions in a way that overlaps. Take the following session.


>>> it1 = gen123plus(10)
>>> it2 = gen123plus(20)
>>> print(next(it1))
11
>>> print(next(it2))
21
>>> print(next(it1))
12
>>> print(next(it1))
13
>>> print(next(it2))
22
>>> print(next(it1))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration
>>> print(next(it2))
23
>>> print(next(it2))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

I create two generator objects from the one generator function. I can then get the next item from one or other object, and notice how each is suspended and resumed independently. They are independent in every way, including in how they fall into StopIteration.

Make sure that you study this session carefully until you really get what’s going on. Once you get it, you will truly have a basic grasp on generators and what makes them so powerful.

Note that you can use all the usual positional and keyword argument features as well.

Local state in generator functions

You can do all the normal things with conditions, loops, and local variables in generator functions and build up to very sophisticated and specialized iterators.

Let’s have a bit of fun with the next example. We’re all tired of being controlled by the weather. Let’s create some weather of our own. Listing 1 is a weather simulator that prints a series of sunny or rainy days, with the occasional commentary.

If you think about the weather, a sunny day is often followed by another sunny day, and a rainy day is often followed by another rainy day. You can simulate this by randomly choosing the next day’s weather, but with a higher probability that the weather will stay the same. One word for weather that is very likely to change is volatile, and in this generator function has an argument, volatility, which should be between 0 and 1. The lower this argument, the more chance that the weather will stay the same from day to day. In this listing, volatility is set to 0.2, which means that on average 4 out of 5 transitions should stay the same.

The listing has the added feature that if there are more than three sunny days in a row, or more than three rainy days in a row, it posts a bit of commentary.

Listing 1. Weather simulator

import random

def weathermaker(volatility, days):
    '''
    Yield a series of messages giving the day's weather and occasional commentary

    volatility ‑ a float between 0 and 1; the greater this number the greater
                    the likelihood that the weather will change on each given day
    days ‑ number of days for which to generate weather
    '''
    #Always start as if yesterday were sunny
    current_weather = 'sunny'
    #First item is the probability that the weather will stay the same
    #Second item is the probability that the weather will change
    #The higher the volatility the greater the likelihood of change
    weights = 1.0‑volatility, volatility    #For fun track how many sunny days in a row there have been
    sunny_run = 1
    #How many rainy days in a row there have been
    rainy_run = 0
    for day in range(days):
        #Figure out the opposite of the current weather
        other_weather = 'rainy' if current_weather == 'sunny' else 'sunny'
        #Set up to choose the next day's weather. First set up the choices
        choose_from = current_weather, other_weather        #random.choices returns a list of random choices based on the weights
        #By default a list of 1 item, so we grab that first and only item with 0        current_weather = random.choices(choose_from, weights)0        yield 'today it is ' + current_weather
        if current_weather == 'sunny':
            #Check for runs of three or more sunny days
            sunny_run += 1
            rainy_run = 0
            if sunny_run >= 3:
                yield "Uh oh! We're getting thirsty!"
        else:
            #Check for runs of three or more rainy days
            rainy_run += 1
            sunny_run = 0
            if rainy_run >= 3:
                yield "Rain, rain go away!"
    return

#Create a generator object and print its series of messages
for msg in weathermaker(0.2, 10):
    print(msg)

The weathermaker function uses many common programming features but also illustrates some interesting aspects of generators. The number of items yielded is not fixed. It can be as few as the number of days, or it could be more because of the commentary on runs of sunny or rainy days. These are yielded in different condition branches.

Run the listing and you should see something like:


$ python weathermaker.py
today it is sunny
today it is sunny
Uh oh! We're getting thirsty!
today it is sunny
Uh oh! We're getting thirsty!
today it is sunny
Uh oh! We're getting thirsty!
today it is rainy
today it is sunny
today it is rainy
today it is rainy
today it is rainy
Rain, rain go away!
today it is rainy
Rain, rain go away!

Of course it’s based on randomness and there is a 4 out of 5 chance each time that the weather stays the same, so you could just as easily get:


$ python weathermaker.py
today it is sunny
today it is sunny
Uh oh! We're getting thirsty!
today it is sunny
Uh oh! We're getting thirsty!
today it is sunny
Uh oh! We're getting thirsty!
today it is sunny
Uh oh! We're getting thirsty!
today it is sunny
Uh oh! We're getting thirsty!
today it is sunny
Uh oh! We're getting thirsty!
today it is sunny
Uh oh! We're getting thirsty!
today it is sunny
Uh oh! We're getting thirsty!
today it is sunny
Uh oh! We're getting thirsty!

Take some time to play around with this yourself, first of all passing different values in for volatility and days, and then tweaking the generator function code itself. Experimentation is the best way to be sure you really understand how the generator works.

I hope this more interesting example fuels your imagination with some of the power of generators. You could write the previous code without generators certainly, but not only is this approach more expressive and usually more efficient, but you have the advantage of being able to reuse the weathermaker generator in other interesting ways besides the simple loop at the bottom of the listing.

Generator expressions

A common use of generators is to iterate over one iterator and manipulate it in some way, producing a modified iterator.

Let’s write a generator that takes an iterator and substitutes values found in a sequence according to a provided set of replacements.


def substituter(seq, substitutions):
    for item in seq:
        if item in substitutions:
            yield substitutionsitem        else:
            yield item

In the following session, you can see an example of how to use this generator:

>>> s = 'hello world and everyone in the world'
>>> subs = {'hello': 'goodbye', 'world': 'galaxy'}
>>> for word in substituter(s.split(), subs):
...     print(word, end=' ')
... 
goodbye galaxy and everyone in the galaxy

Again, take some time to play around with this yourself, trying other loop manipulations until you understand clearly how the generator works.

This sort of manipulation is so common that Python provides a handy syntax for it, called a generator expression. Here is the previous session implemented using a generator expression.

>>> words = ( subs.get(item, item) for item in s.split() )
>>> for word in words:
...     print(word, end=' ')
... 
goodbye galaxy and everyone in the galaxy

In short, any time you have parentheses around a for expression, it is a generator expression. The resulting object, assigned to words in this case, is a generator object. Sometimes you end up using some of the more interesting bits of Python to fit such expressions. In this case, I take advantage of the get method on dictionaries, which looks up a key but allows me to specify a default to be returned if the key is not found. I ask for either the substitution value of item, if found, otherwise just item as is.

List comprehensions recap

You might be familiar with list comprehensions. This is a similar syntax, but using square brackets. The result of a list comprehension is a list:


>>> mylist =  ix for ix in range(10, 20) >>> print(mylist)
10, 11, 12, 13, 14, 15, 16, 17, 18, 19

A generator expression’s syntax is similar, but it returns a generator object:


>>> mygen = ( ix for ix in range(10, 20) )
>>> print(mygen)
<generator object <genexpr> at 0x10ccccba0>
>>> print(next(mygen))
10

The main practical difference between these last two examples is that the list created in the first sits there from the moment it is created, taking up all the memory needed to store its values. The generator expression doesn’t use so much storage and is rather suspended and resumed whenever it is iterated over, as the body of a generator function would be. In effect, it allows you to get the data on demand, rather than having it all prestocked for you.

An off-the-cuff analogy is that your household might drink 200 gallons of milk each year, but you don’t want to have to build a storage facility in your basement for all that milk. Instead, you go to the store to buy a gallon at a time, as you need more milk. Using a generator instead of building lists all the time is a bit like using your grocery store rather than building yourself a warehouse.

There are also dictionary expressions, but these are outside the scope of this tutorial. Note that you can easily convert a generator expression into a list, and this can sometimes be a way of consuming a generator all at once, but it can also defeat the purpose of using a generator by creating memory-hungry lists if you’re not careful.


>>> mygen = ( ix for ix in range(10, 20) )
>>> print(list(mygen))
10, 11, 12, 13, 14, 15, 16, 17, 18, 19

I’ll sometimes construct lists from generators in this tutorial series for quick demonstrations.

Filtering and chaining

You can use a simple condition in generator expressions to filter out items from the input iterator. The following example produces all numbers from 1 to 20 that are neither multiples of 2 or 3. It uses the handy math.gcd function, which returns the greatest common divisor of two integers. If the GCD of a number and 2, for example, is 1, then that number is not a multiple of 2.


>>> import math
>>> notby2or3 = ( n for n in range(1, 20) if math.gcd(n, 2) == 1 and math.gcd(n, 3) == 1 )
>>> print(list(notby2or3))
1, 5, 7, 11, 13, 17, 19

You can see how the if expression is right inline within the generator expression. Note that you can also nest the for expressions in generator expressions. Then again, generator expressions are really just compact syntax for generator functions, so if you start needing really complex generator expressions, you might end up with more readable code just using generator functions.

You can chain together generator objects, including generator expressions.


>>> notby2or3 = ( n for n in range(1, 20) if math.gcd(n, 2) == 1 and math.gcd(n, 3) == 1 )
>>> squarednotby2or3 = ( n*n for n in notby2or3 )
>>> print(list(squarednotby2or3))
1, 25, 49, 121, 169, 289, 361

Such patterns of chained generators are powerful and efficient. In the previous example, the first line defines a generator object, but none of its work is done. The second line defines a second generator object, which refers to the first one, but neither of the work is done for these objects. It’s not until the full iteration is requested, in this case by the list constructor, that all the work is done. This idea of doing the work of iterating over things as needed is called lazy evaluation, and it is one of the hallmarks of well-designed code using generators. When you use such chains of generators and generator expressions, however, remember that something has to actually trigger the iteration. In this case, it’s the list function. It could also be a for loop. It’s an easy mistake to set up all sorts of generators and then forget to trigger the iteration, in which case you end up scratching your head as to why your code is not doing anything.

The value of laziness

In this tutorial, you’ve learned the basics of iterators and also the most interesting sources of iterators, generator functions, and expressions.

You could certainly write all the code in this tutorial without any generators, but learning to use generators opens up more flexible and efficient ways of thinking about operating on any concepts or data that develop in a series. To repeat my analogy from earlier, it makes more sense to get a gallon of milk from the store at a time, on demand, rather than building yourself a warehouse with a year’s supply. Though developers call the equivalent approach lazy evaluation, the laziness is more about the timing of when you obtain what you need. It probably doesn’t seem so lazy to make a trip to the grocery store every other day. Similarly, sometimes writing code to use generators can take a bit more work and can even be a bit mind-bending, but the benefits come with more scalable processing.

Learning iterators and generators is one important step in mastering Python, and another is learning the many amazing tools provided in the standard library to process iterators. That will be the topic of the next tutorial in this series.