Debugging TensorFlow programs

There are well established techniques for debugging programs in common language such as Python, Java, C:

  • The easiest and quickest way is to add a few print statements in your code to inspect the value of some variables.

  • You can add some code to check for error condition, for instance using assert() to test for a value that a variable can never have.

  • For more complex bugs, the next step is to use an interactive debugger like pdb (Python) or gdb (C) that lets you stop at a breakpoint or when an exception is caught. You get access to the internal program environment and you can inspect the variables, traverse the stack frames, and step through some code to observe the behavior of your program.

  • For long running processes, you can log messages to capture the values of variables during the execution.

*At a higher level of abstraction, you can save arbitrary data for post-mortem analysis using advanced tools.

In this tutorial, we will demonstrate how to perform the familiar debugging techniques mentioned above in a TensorFlow program. The reader should have a basic understanding of TensorFlow, Deep Learning and debugging practices. To try out the code examples, you should have TensorFlow installed on your workstation. The tutorial should take about 30 minutes to walk through.


In typical procedural programs each line of code is executed as they appear in the code and you can step through the code to observe the effect, so it is relatively straightforward to identify the bug. Multi-threaded programs add some complication, but the general approach is the same.

TensorFlow programs, however, are different. TensorFlow programs implement a neural network, so by nature they consist of graphs. The programs have two distinct steps: (1) constructing a data flow graph to represent the neural network, and (2) executing the graph by performing computation as data is fed through the input. The graph execution either runs locally in a separate process or remotely on a different server or cluster. Because of this two-steps flow, you cannot debug a TensorFlow program using the usual debugging techniques above. For example, if you add a print statement in the program to show a tensor, it will only display the data structure of the graph being constructed. The same is true if you stop the program in a normal debugger and print the value of a tensor. Let’s try this in Python by running the interpreter:

$ python
Python 2.7.10 (default, Jul 15 2017, 17:16:57)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> a = tf.random_uniform([4,4])
>>> a
<tf.Tensor 'random_uniform:0' shape=(4, 4) dtype=float32>

The reason is because the actual values are not computed and become available until the graph is executed, and this execution happens in a different process elsewhere.

Fortunately, debugging techniques have been developed for TensorFlow programs that to a large degree mirror the techniques above. They do require some specific set up, but they allow you to debug in the familiar approach. Since Python has the most complete support in TensorFlow, in this howto document, we will focus on Python-based techniques, although some works in other languages as well.

TensorFlow debugging techniques

Printing values

To be able to see the values of the tensor, we want the print statement to be performed during the graph execution, so the technique is to insert a special node in the graph that is really a no-op but has the side effect of printing the values of some tensors. This special node is the operation tf.Print. You can insert this node as follows:

a = tf.random_uniform([4,4])
b = tf.ones([4,4])
a = tf.Print(a, [a, b], message="Values for a and b: ", summarize=6)

Run this program segment and you will see the first six values of the tensors (as specified in the count for summarize):

2017-12-01 16:46:35.425191: I tensorflow/core/kernels/]
Values for a and b:
[[0.726375699 0.495358586 0.64934361 0.650732636][0.953052521 0.931101799]...][[1 1 1 1][1 1]...]

You can use any tensor as target for tf.Print(). Note that tf.Print() is an identity operation, which means that it returns the same tensor, so the syntax as shown above is to pass in a tensor and assign to the same tensor. When the graph is executed, this node will be evaluated and the tensor values will be printed to the console. If you have multiple graphs, make sure to insert in the graph that you will be executing. If you run in distributed mode, the usual method for distributing the workload is data parallelism, so the tensors will be printed on each server console and will reflect the values being computed on the particular servers.

Asserting and checking values

TensorFlow library offers the usual assert function as well as several functions checking for values that are common in numerical computation.

The interactive debugger

For more general debugging where you want to look around in the execution environment for potential problems, tfdbg is an interactive debugger that provides access to the environment where the graph is executed. This works through a wrapper for the call to the runtime to capture the point of entry. For this reason, you will need to make some change to your program to add the wrapper for tfdbg:

from tensorflow.python import debug as tf_debug
sess = tf.InteractiveSession()
sess = tf_debug.LocalCLIDebugWrapperSession(sess)

When a call is made to execute a graph, an interactive shell pops up to allow you to inspect the tensors and operations.


You can print the current values of the tensors, its shape, and storage size. If your program is getting Out Of Memory error, it is helpful to check the storage sizes to help pinpoint the problem. Tensors tend to have many dimensions and the shape is often derived from the input, so its size can grow in unexpected manner, causing memory problems.


Navigating large multidimensional tensors interactively is cumbersome. To help in this respect, you can register filters, which are code to scan the tensors for particular conditions such as infinity or NaN values. For managing the control flow such as stepping or continuing, currently tfdbg only provides primitive support; future versions may add more support. For more details, please visit the TensorFlow tfdbg page.

Dynamic Graph

Interactive debugging is also possible through a recently announced feature, eager execution, that was motivated by PyTorch’s dynamic graph. TensorFlow’s graph is normally static; in other words, the graph must be fully created before it can be executed. In eager execution mode, the graph is constructed dynamically and evaluated immediately. The intention is to enable easy experimentation with your neural network, but a side benefit is that now you can use your Python interpreter to debug in the same way as your normal Python program. Since this feature has not been included release 1.4, to use the eager execution mode, you will need to install the TensorFlow version from the master repository. You can clone and build from the latest master, or you can install directly from the nightly build, as follows:

sudo pip install tf-nightly

It may be a good practice to install the nightly build in a virtualenv to isolate this non-official version of TensorFlow, until the feature is included in an official release. Let’s try this out in a Python interpreter:

$ python
Python 2.7.10 (default, Jul 15 2017, 17:16:57)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> import tensorflow.contrib.eager as tfe
>>> tfe.enable_eager_execution()
>>> a = tf.random_uniform([4,4])
>>> a
<tf.Tensor: id=6, shape=(4, 4), dtype=float32, numpy=
array([[  8.63772631e-02,   1.43377185e-01,   4.90468979e-01,   9.01059031e-01],
       [  6.33775592e-01,   4.07707214e-01,   3.92198563e-04,   9.46395278e-01],
       [  9.71757889e-01,   9.00770664e-01,   3.30629349e-01,   1.42947078e-01],
       [  4.40206051e-01,   7.27719307e-01,   9.77952838e-01,   9.75356817e-01]],

You can see that in addition to the usual metadata for the tensor, the values of the tensor are now computed and available immediately without having to execute the full graph. Note that static and dynamic graphs are mutually exclusive, so you will have to make the choice at the beginning of the program. The default mode is static graph. Since the eager execution mode is in alpha release, it is not yet supported in many key TensorFlow functions such as distributed mode, TensorBoard, etc. As a result, this is currently useful to debug portions of your graph that involve only numerical computation. For more details, you can visit the TensorFlow Eager Execution page.


Logging is useful for debugging long running training sessions or processes servicing inferences. TensorFlow supports the usual logging mechanism, with 5 levels in order of increasing severity as follows:

  • INFO
  • WARN

Note that the logs are generated from the graph execution, which occurs in the runtime. Setting a particular log level will show all messages from that level and all levels more severe. You can set the log level in the program by:


Since the runtime is implemented in C++, you can also set the C++ environment variables:


For the environment variables, the default value is 0, so all logs are shown. Set TF_CPP_MIN_LOG_LEVEL to 1 to filter out INFO logs and below, 2 to filter out WARN, 3 to filter out ERROR, etc. If TF_CPP_MIN_LOG_LEVEL is set, then TF_CPP_MIN_VLOG_LEVEL is ignored.

There are also API calls to inject your own log messages from your program at the desired level:

For more details, please visit the TensorFlow logging page.


When your program seems to run correctly but is not producing the expected result, you will need to debug at a higher level, and TensorBoard can be useful for this purpose. TensorBoard is a visualization tool for post-mortem analysis: you need to add calls in your program to generate data and write to an event file. First you need to create the event file:

writer = tf.summary.FileWriter('./tensorflow_logs/mnist_deep', sess.graph)

As you generate data, you can push to the file by:

writer.add_summary(summary, i)
merged = tf.summary.merge_all()

And remember to close the file handle before exiting your program:


Please refer to the module tf.summary for the complete API for working with TensorBoard data. After your program has completed, you can run TensorBoard against this data:

tensorboard --logdir=./tensorflow_logs

TensorBoard runs as a web server, so you can access on the browser using the link provided. The API supports simple graph and histogram of any tensor, for example:

tf.summary.scalar('loss', cross_entropy)
tf.summary.histogram('softmax', y)



The API also supports audio and image data, allowing you to verify the input for training or the transformed data within the neural network:

tf.summary.image('input', x_image, 4)

For instance, you can display the images after convolution:


Viewing the graph that implements your neural network is useful for spotting errors in the implementation. To make the graph more readable, add names your tensors and operations:

x = tf.placeholder(tf.float32, [None, 784], name='x')
with tf.name_scope('optimizer'):
    train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy, name='train_step')

The names will be used to display the graph for your neural network:


There is also support for advanced visualization of the clustering behavior in your tensors. For more details, please visit the TensorBoard page.


In this tutorial, we have seen how TensorFlow programs are different from your typical procedural programs and why debugging them requires some adjustment. It is helpful to be able to reuse the debugging practices we are familiar with as well as a few newer techniques.

Happy debugging!