Writing clean, testable, high quality code in Python


Writing software is among the most complicated endeavors a human can undertake. Brian Kernigan, co-author of the AWK programming language and “K and R C”, sumed up the true nature of software development in the book, Software Tools, when he stated, “Controlling complexity is the essence of software development.” The harsh reality of real world software development is that software is often created with intentional, or unintentional, complexity and a disregard for maintainability, testability, and quality. The end result of this unfortunate reality is software that can become increasingly difficult and expensive to maintain and that fails sporadically and even spectacularly.

The first step in the process of writing high quality code is to re-examine the entire thought process of how an individual or team develops software. Often in failed, or troubled, software development projects, the software was developed in a reactionary stream of consciousness where the focus of the software development was on getting a problem solved in any manner possible. In a successful software project, the developer is thinking not only about how to solve the problem at hand, but additionally about the process involved in solving the problem.

A succesful software developer will devise a way to run the tests in an easily automated fashion, so they can continuously prove the software works. They are aware of the dangers of needless complexity. They are humble in their approach, seek critical review, and expect refactoring at every step of the way. They continuously think about how they can ensure their software is testable, readable, and maintainable. Although Python the language, and Python the community, are heavily influenced by desire to write clean, maintainable code that works, it is still quite easy to do the exact opposite. In this article, we will tackle this problem head on and explore how to write clean, testable, high quality code in Python.

A clean code hypothetical problem

The best way to demonstrate this style of development is to solve a hypothetical problem. Let’s suppose you are a back-end web developer at a company that allows users to generate reviews, and you need to come up with a way to show and highlight small snippets of those reviews. One way to approach the problem would be to write a large function that takes a snippet of text, and query parameters, and returns back a character limited snippet with the query parameters highlighted. All of the logic needed to solve the problem would be included in the one “mega” function, and you would simply need to keep rerunning your script, until you got the result you wanted. The format would probably look like the code example below and would often be developed with a combination of print statements, or logging statements, and an interactive shell.

Listing 1. Messy code
def my_mega_function(snippet, query)
    """This takes a snippet of text, and a query parameter and returns """

    #Logic goes here, and often runs on for several hundred lines
    #There are often deeply nested conditional statements and loops
    #Function could reach several hundred, if not thousands of lines
    return result

With a dynamic language like Python, Perl, or Ruby, it is easy to develop software by simply banging away at the problem, often interactively, until you get what seems to be the correct result and calling it a day. Unfortunately, this approach, while tempting, often leads to a false sense of accomplishment that is fraught with danger. Much of the danger lies in not designing a solution to be testable, and part lies in not properly controlling the complexity of the software written.

How can you say this function even works? You can have faith that it works because it worked the last time you ran it during development, but are you sure it doesn’t contain subtle errors of logic or syntax? What happens if you need to change the code? Would it still work, and how would you know it still worked? What if that code needed to be maintained by another developer, and he needed to make changes to it? How would he know his changes didn’t cause something subtle to break? How hard would it be for him to understand what the code does?

The short answer is: if you don’t have tests, you don’t know if your software works. If you stack together enough guesses, you may eventually build something that appears to function, but that no human could ever say with certainty ever worked properly. This is a bad place to be, and I have both written this software and helped debug software written this way. Fortunately, this condition is easily avoidable. Writing tests before, such as the case of Test Driven Development, or while you write your logic actually shapes the way code is written. It leads to modular, extensible code that is easy to test, understand, and maintain. It is immediately apparent to the experienced developer when software was developed with testing in mind, and when it was not. The software itself looks dramatically different to the trained eye.

Without simply taking my word for it, or visually inspecting code, there are ways to measure scientifically the difference between these two different styles. The first way is to actually measure the lines of code that are tested. Nose is a popular extension of Python’s unit test framework that includes an easy way to run automatically a batch of tests and plug-ins, such as code coverage. By measuring code coverage during development, it becomes quickly apparent that it is almost impossible to get 100 percent test coverage for code that is composed of large functions, with highly nested logic, that are built in an ad hoc manner.

The second way to measure the difference is to use static analysis tools. There are several popular Python tools that measure various metrics for Python developers, ranging from general code quality to specific metrics, like duplicate code or complexity. You can measure the cyclomatic complexity of your code with either pygenie or pymetrics (see resources on the right).

Here is an example of what it looks like when we run pygenie on “clean” code that is relatively simple:

Listing 2. Pygenie output of cyclomatic complexity
% python pygenie.py complexity ‑‑verbose highlight spy
File: /Users/ngift/Documents/src/highlight.py
Type Name                                                                   Complexity 
M    HighlightDocumentOperations.createsnippit                                  3
M    HighlightDocumentOperations._reconstruct_document_string                     3
M    HighlightDocumentOperations._doc_to_sentences                                2
M    HighlightDocumentOperations._querystring_to_dict                             2
M    HighlightDocumentOperations._word_frequency_sort                             2
M    HighlightDocumentOperations.highlight_doc                                    2
X    /Users/ngift/Documents/src/highlight.py 1          
C    HighlightDocumentOperations                                                  1
M    HighlightDocumentOperations.__init                                         1
M    HighlightDocumentOperations._custom_highlight_tag                            1
M    HighlightDocumentOperations._score_sentences                                 1
M    HighlightDocumentOperations._multiple_string_replace                         1

As you can tell from the example, every method is extremely simple and contains a complexity rating under 10, which is desirable according to McCabe’s research. In my experiences, I have seen “mega” functions written without testing that had complexity ratings over 140 and have stretched over 1200 lines. Suffice to say, it is literally impossible to test code like this. There is actually no way to ever know it works and refactoring it is impossible. If the author of the code kept testing in mind, and wrote the same logic with 100 percent test coverage, it is highly unlikely it would have such a high complexity rating.

A clean code hypothetical solution

Let’s now take a look at a complete source code example with accompanying unit tests and functional tests and see what it actually does, and why this code is considered clean. One reasonable definition of clean, using strictly metrics, is that it fulfills the following requirements: it has close to 100 percent test coverage; it has a cyclomatic complexity rating of under 10 for all classes and methods; and it scores close to a 10.0 rating with pylint. Here is an example of using nose to test unit test and doctest coverage on the highlight module:

Listing 3. Running nosetests with coverage reporting: 100 percent coverage
% nosetests ‑v ‑‑with‑coverage ‑‑cover‑package=highlight ‑‑with‑doctest\
     ‑‑cover‑erase ‑‑exe

Doctest: highlight.HighlightDocumentOperations._custom_highlight_tag ... ok
test_functional.test_snippit_algorithm ... ok
test_custom_highlight_tag (test_highlight.TestHighlight) ... ok
Consumes the generator, and then verifies the result[0] ... ok
Verifies highlighted text is what we expect ... ok
test_multi_string_replace (test_highlight.TestHighlight) ... ok
Verifies the yielded results are what is expected ... ok

Name        Stmts   Exec  Cover   Missing
highlight      71     71   100%   
Ran 7 tests in 4.223s


As you can see from the above snippet, the nosetests command was run with several options, and there was 100 percent test coverage for the highlight spy script. The only thing of real note to point out is that --cover-package=highlight is a way of telling nose to show only the coverage report on a specified module. This is very useful to isolate the output of a coverage report to the module or packages you want to observe coverage reporting on. One thing you may want to try is to download the source code from this article and to comment out some of the tests to see how the coverage reporting mechanism really works.

Listing 4. highlight spy
#‑∗‑ coding: utf‑8 ‑∗‑

:mod:highlight ‑‑ Highlight Methods

.. module:: highlight
   :platform: Unix, Windows
   :synopsis: highlight document snippets that match a query.
.. moduleauthor:: Noah Gift

    1.  You will need to install the ntlk library to run this code.
    2.  You will need to download the data for the ntlk:
        See http://www.nltk.org/data::
        import nltk


import re
import logging

import nltk

LOG = logging.getLogger("highlight")

class HighlightDocumentOperations(object):

    """Highlight Operations for a Document"""
    def init(self, document=None, query=None):
            document (str):
            query (str):
        self._document = document
        self._query = query
    def _custom_highlight_tag(phrase,
        """Injects an open and close highlight tag after a word

            phrase (str) ‑ A word or phrase.
        start (str) ‑ An opening tag.  Defaults to <strong>
        end (str) ‑ A closing tag.  Defaults to </strong>
            (str) word or phrase with custom opening and closing tags
        >>> h = HighlightDocumentOperations()
        >>> h._custom_highlight_tag("foo")
        tagged_phrase = "{0}{1}{2}".format(start, phrase, end)
        return tagged_phrase
    def _doc_to_sentences(self):
        """Takes a string document and converts it into a list of sentences
        Unfortunately, this approach might be a tad naive for production
        because some segments that are split on a period are really an
        abbreviation, and to make things even more complicated, an
        abbreviation can also be the end of a sentence::
            (generator) A generator object of a tokenized sentence tuple,
            with the list position of sentence as the first portion of
            the tuple, such as:  (0, "This was the first sentence")
        tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
        sentences = tokenizer.tokenize(self._document)
        for sentence in enumerate(sentences):
            yield sentence

    def _score_sentences(sentence, querydict):
        """Creates a scoring system for each sentence by substitution analysis
        Tokenizes each sentence, counts characters
        in sentence, and pass it back as nested tuple
            (tuple) ‑ (score (int), (count (int), position (int),
                   raw sentence (str))
        position, sentence = sentence
        count = len(sentence)
        regex = re.compile('|'.join(map(re.escape, querydict)))
        score = len(re.findall(regex, sentence))
        processed_score = (score, (count, position, sentence))
        return processed_score
    def _querystring_to_dict(self, split_token="+"):
        """Converts query parameters into a dictionary
            (dict)‑ dparams, a dictionary of query parameters
        params = self._query.split(split_token)
        dparams = dict([(key, self._custom_highlight_tag(key)) for\
                    key in params])
        return dparams
    def _word_frequency_sort(sentences):
        """Sorts sentences by score frequency, yields sorted result
        This will yield the highest score count items first.
            sentences (list) ‑ a nested tuple inside of list
            (0, (90, 3, "The crust/dough was just way too effin' dry for me.
            Yes, I know what 'cornmeal' is, thanks."))

        while sentences:
            yield sentences.pop()

    def _create_snippit(self, sentences, max_characters=175):
        """Creates a snippet from a sentence while keeping it under max_chars 
        Returns a sorted list with max characters.  The sort is an attempt
        to rebuild the original document structure as close as possible,
        with the new sorting by scoring and the limitation of max_chars.
            sentences (generator) ‑ sorted object to turn into a snippit
            max_characters (int) ‑ optional max characters of snippit
            snippit (list) ‑ returns a sorted list with a nested tuple that
            has the first index holding the original position of the list::
            (0, (90, 3, "The crust/dough was just way too effin' dry for me.
            Yes, I know what 'cornmeal' is, thanks."))            
        snippit =         total = 0
        for sentence in self._word_frequency_sort(sentences):
            LOG.debug("Creating snippit", sentence)
            score, (count, position, raw_sentence) = sentence
            total += count
            if total < max_characters:
                #position now gets converted to index 0 for sorting later
                snippit.append(((position), score, count, raw_sentence))
        #try to reassemble document by original order by doing a simple sort
        return snippit
    def _multiple_string_replace(string_to_replace, dict_patterns):
        """Performs a multiple replace in a string with dict pattern.
        Borrowed from Python Cookbook.
            string_to_replace (str) ‑ String to be multi‑replaced
            dict_patterns (dict) ‑ A dict full of patterns
            (str) ‑ Multiple replaced string.
        regex = re.compile('|'.join(map(re.escape, dict_patterns)))
        def one_xlat(match):
            """Closure that is called repeatedly during multi‑substitution.
                match (SRE_Match object)
                partial string substitution (str)
            return dict_patternsmatch.group(0)        
        return regex.sub(one_xlat, string_to_replace)
    def _reconstruct_document_string(self, snippit, querydict):
        """Reconstructs string snippit, build tags, and return string
        A helper function for highlight_doc.
            string_to_replace (list) ‑ A list of nested tuples, containing
            this pattern::
            (0, (90, 3, "The crust/dough was just way too effin' dry for me.
            Yes, I know what 'cornmeal' is, thanks."))            
            dict_patterns (dict) ‑ A dict full of patterns
            (str) The most relevant snippet with the query terms highlighted.
        snip =         for entry in snippit:
            score = entry1            sent = entry3            #if we have matches, now do the multi‑replace
            if score:
                sent = self._multiple_string_replace(sent,
        highlighted_snip = " ".join(snip)
        return highlighted_snip
    def highlight_doc(self):
        """Finds the most relevant snippit with the query terms highlighted
            (str) The most relevant snippet with the query terms highlighted.
        #tokenize to sentences, and convert query to a dict
        sentences = self._doc_to_sentences()
        querydict = self._querystring_to_dict()
        #process and score sentences
         scored_sentences =         for sentence in sentences:
            scored = self._score_sentences(sentence, querydict)
        #fit into max characters, and sort by original position
        snippit = self._create_snippit(scored_sentences)
        #assemble back into string
        highlighted_snip = self._reconstruct_document_string(snippit,

        return highlighted_snip

Listing 5. testhighlight.py
#‑∗‑ coding: utf‑8 ‑∗‑
Tests this query searches a document, highlights a snippit and returns it

Contains both unit and functional tests.


import unittest
from highlight import HighlightDocumentOperations

class TestHighlight(unittest.TestCase):
    def setUp(self):
        self.document = """
Review for their take‑out only.
Tried their large Classic (sausage, mushroom, peppers and onions) deep dish;\
and their large Pesto Chicken thin crust pizzas.
Pizza = I've had better.  The crust/dough was just way too effin' dry for me.\
Yes, I know what 'cornmeal' is, thanks.  But it's way too dry.\
I'm not talking about the bottom of the pizza...I'm talking about the dough \
that's in between the sauce and bottom of the pie...it was like cardboard, sorry!
Wings = spicy and good.   Bleu cheese dressing only...hmmm, but no alternative\
of ranch dressing, at all.  Service = friendly enough at the counters.  
Decor = freakin' dark.  I'm not sure how people can see their food.  
Parking = a real pain.  Good luck.        
        self.query = "deep+dish+pizza"
        self.hdo = HighlightDocumentOperations(self.document, self.query)
    def testcustom_highlight_tag(self):
        actual = self.hdo._custom_highlight_tag("foo",
        expected = "[BAR]foo[ENDBAR]"
    def test_query_string_to_dict(self):
        """Verifies the yielded results are what is expected"""
        result = self.hdo._querystring_to_dict()
        expected = {"deep": "deep",
                    "dish": "dish",
    def test_multi_string_replace(self):
        query = """pizza = I've had better"""
        expected = """pizza = I've had better"""
        query_dict = self.hdo._querystring_to_dict()
        result = self.hdo._multiple_string_replace(query, query_dict)
        self.assertEqual(expected, result)
    def test_doc_to_sentences(self):
        """Consumes the generator, and then verifies the result[0]"""
        results =         expected = (0,'\nReview for their take‑out only.')
        for sentence in self.hdo._doc_to_sentences():
        self.assertEqual(results[0], expected)
    def test_highlight(self):
        """Verifies highlighted text is what we expect"""
        expected = """Tried their large Classic (sausage, mushroom, peppers and onions)\
deepdish;and their large Pesto Chicken thin crust \
        actual = self.hdo.highlight_doc()
        self.assertEqual(expected, actual)
    def tearDown(self):
        del self.query
        del self.hdo
        del self.document

if __name == '__main':

Listing 6. testfunctionalhighlight.py
"""Functional Test That Performs Some Basic Sanity Checks"""

from highlight import HighlightDocumentOperations

def testsnippitalgorithm():
    document1 = """
        This place has awesome deep dish pizza.
        I have been getting delivery through Waiters on wheels for years.
        It is classic, deep dish  Chicago style pizza.
        Now I found out they also have half‑baked to pick‑up and cook at home.
        This is a great benefit. I am having it tonight. Yum.
    document2 = """Review for their take‑out only.
Tried their large Classic (sausage, mushroom, peppers and onions) deep dish;\
and their large Pesto Chicken thin crust pizzas.
Pizza = I've had better.  The crust/dough was just way too effin' dry for me.\
Yes, I know what 'cornmeal' is, thanks.  But it's way too dry.\
I'm not talking about the bottom of the pizza...I'm talking about the dough \
that's in between the sauce and bottom of the pie...it was like cardboard, sorry!
Wings = spicy and good.   Bleu cheese dressing only...hmmm, but no alternative\
of ranch dressing, at all.  Service = friendly enough at the counters.  
Decor = freakin' dark.  I'm not sure how people can see their food.  
Parking = a real pain.  Good luck."""
    h1 = HighlightDocumentOperations(document1, "deep+dish+pizza")
    actual = h1.highlight_doc()
    print "Raw Document1: %s" % document1
    print " Formatted Document1: %s" % actual
    assert  len(actual) < 500
    assert "<strong>" in actual

    h2 = HighlightDocumentOperations(document2, "deep+dish+pizza")
    actual = h2.highlight_doc()
    print "Raw Document2: %s" % document2
    print " Formatted Document2: %s" % actual
    assert  len(actual) < 500
    assert "<strong>" in actual

if __name == "__main":

Concerning the above code sample, if you would like to run it, you will need to download the Natural Language Toolkit source and download the nltk data according to the instructions. Since this article is not about the code sample shown but about how it was created, and how to test it, I won’t go into any detail explaining what the code actually does. Instead, let’s finish up by running the static code analysis tool pylint on our source code:

Listing 7. Pylint
% pylint highlight spy 
No config file found, using default configuration
∗∗∗∗∗∗∗∗∗∗∗∗∗ Module highlight
E: 89:HighlightDocumentOperations._doc_to_sentences: Instance of 'unicode' has no 
    'tokenize' member (but some types could not be inferred)
E: 89:HighlightDocumentOperations._doc_to_sentences: Instance of 'ContextFreeGrammar' 
    has no 'tokenize' member (but some types could not be inferred)
W:108:HighlightDocumentOperations._score_sentences: Used builtin function 'map'
W:192:HighlightDocumentOperations._multiple_string_replace: Used builtin function 'map'
R: 34:HighlightDocumentOperations: Too few public methods (1/2)

69 statements analysed.

Global evaluation
Your code has been rated at 8.12/10 (previous run: 8.12/10)

The code scored an 8.12 out of 10 and was nicked down for a few items. Pylint is configurable, so it is very likely that you may need to configure it to meet your needs on your project. You can refer to the official pylint document (see resources on the right). For this specific example, there are two errors on line 89 that can be attributed to the external library nltk, and there are two warnings that could be changed by a configuration change to pylint. In general, you will never want to allow pylint errors in your source code, but there are some times, such as in the example above, that you may need to make an executive decision. It isn’t a perfect tool, but I have found it to be very useful in the real world.


In this article, we explored how merely thinking about testing influences the structure of software, and how a lack of thought toward testing can prove fatally harmful to a project. We showed a complete code example, that included both functional and unit tests, and ran it against both code coverage analysis with nose and two static analysis tools, pylint, and pygenie. One thing we didn’t have time to cover was how to automate this with some form of continuous integration testing. Fortunately, this is quite simple with the open source Java™ Continuous Integration System, Hudson. I would encourage you to consult the Hudson documentation (see resources on the right) and experiment with setting up an automated tests for your project that runs all of your tests, including static code analysis.

Finally, testing isn’t a panacea, nor are static analysis tools. Software development is hard work. To get the chance even to be successful, we have to always be mindful of the real goal. It is not only to solve a problem, but also to create something we can prove works. If you agree with this premise, then this means that overly complex code, arrogance in design, and lack of respect for the power of Python, directly interfere with this goal.

Thanks to Kennedy Behrman, of Imagemovers Digital, for the technical review of this article.