Debugging TensorFlow programs
TensorFlow programs are different from typical programs. Learn techniques for debugging them in different scenarios.
There are well established techniques for debugging programs in common language such as Python, Java, C:
The easiest and quickest way is to add a few print statements in your code to inspect the value of some variables.
You can add some code to check for error condition, for instance using
assert()to test for a value that a variable can never have.
For more complex bugs, the next step is to use an interactive debugger like pdb (Python) or gdb (C) that lets you stop at a breakpoint or when an exception is caught. You get access to the internal program environment and you can inspect the variables, traverse the stack frames, and step through some code to observe the behavior of your program.
For long running processes, you can log messages to capture the values of variables during the execution.
*At a higher level of abstraction, you can save arbitrary data for post-mortem analysis using advanced tools.
In this tutorial, we will demonstrate how to perform the familiar debugging techniques mentioned above in a TensorFlow program. The reader should have a basic understanding of TensorFlow, Deep Learning and debugging practices. To try out the code examples, you should have TensorFlow installed on your workstation. The tutorial should take about 30 minutes to walk through.
In typical procedural programs each line of code is executed as they appear in the code and you can step through the code to observe the effect, so it is relatively straightforward to identify the bug. Multi-threaded programs add some complication, but the general approach is the same.
TensorFlow programs, however, are different. TensorFlow programs implement a neural network, so by nature they consist of graphs. The programs have two distinct steps: (1) constructing a data flow graph to represent the neural network, and (2) executing the graph by performing computation as data is fed through the input. The graph execution either runs locally in a separate process or remotely on a different server or cluster. Because of this two-steps flow, you cannot debug a TensorFlow program using the usual debugging techniques above. For example, if you add a print statement in the program to show a tensor, it will only display the data structure of the graph being constructed. The same is true if you stop the program in a normal debugger and print the value of a tensor. Let’s try this in Python by running the interpreter:
$ python Python 2.7.10 (default, Jul 15 2017, 17:16:57) [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> a = tf.random_uniform([4,4]) >>> a <tf.Tensor 'random_uniform:0' shape=(4, 4) dtype=float32>
The reason is because the actual values are not computed and become available until the graph is executed, and this execution happens in a different process elsewhere.
Fortunately, debugging techniques have been developed for TensorFlow programs that to a large degree mirror the techniques above. They do require some specific set up, but they allow you to debug in the familiar approach. Since Python has the most complete support in TensorFlow, in this howto document, we will focus on Python-based techniques, although some works in other languages as well.
TensorFlow debugging techniques
To be able to see the values of the tensor, we want the print statement to be performed during the graph execution, so the technique is to insert a special node in the graph that is really a no-op but has the side effect of printing the values of some tensors. This special node is the operation tf.Print. You can insert this node as follows:
a = tf.random_uniform([4,4]) b = tf.ones([4,4]) a = tf.Print(a, [a, b], message="Values for a and b: ", summarize=6)
Run this program segment and you will see the first six values of the tensors
(as specified in the count for
2017-12-01 16:46:35.425191: I tensorflow/core/kernels/logging_ops.cc:79] Values for a and b: [[0.726375699 0.495358586 0.64934361 0.650732636][0.953052521 0.931101799]...][[1 1 1 1][1 1]...]
You can use any tensor as target for
tf.Print(). Note that
tf.Print() is an
identity operation, which means that it returns the same tensor, so the syntax
as shown above is to pass in a tensor and assign to the same tensor. When the
graph is executed, this node will be evaluated and the tensor values will be
printed to the console. If you have multiple graphs, make sure to insert in the
graph that you will be executing. If you run in distributed mode, the usual
method for distributing the workload is data parallelism, so the tensors will
be printed on each server console and will reflect the values being computed on
the particular servers.
Asserting and checking values
TensorFlow library offers the usual assert function as well as several functions checking for values that are common in numerical computation.
tf.Assert and other similar assert functions.
The interactive debugger
For more general debugging where you want to look around in the execution
environment for potential problems,
tfdbg is an interactive debugger that
provides access to the environment where the graph is executed. This works
through a wrapper for the call to the runtime to capture the point of entry.
For this reason, you will need to make some change to your program to add the
from tensorflow.python import debug as tf_debug sess = tf.InteractiveSession() sess = tf_debug.LocalCLIDebugWrapperSession(sess)
session.run() call is made to execute a graph, an interactive shell
pops up to allow you to inspect the tensors and operations.
You can print the current values of the tensors, its shape, and storage size. If your program is getting Out Of Memory error, it is helpful to check the storage sizes to help pinpoint the problem. Tensors tend to have many dimensions and the shape is often derived from the input, so its size can grow in unexpected manner, causing memory problems.
Navigating large multidimensional tensors interactively is cumbersome. To help
in this respect, you can register filters, which are code to scan the tensors
for particular conditions such as infinity or NaN values. For managing the
control flow such as stepping or continuing, currently
tfdbg only provides
primitive support; future versions may add more support. For more details,
please visit the TensorFlow
Interactive debugging is also possible through a recently announced feature, eager execution, that was motivated by PyTorch’s dynamic graph. TensorFlow’s graph is normally static; in other words, the graph must be fully created before it can be executed. In eager execution mode, the graph is constructed dynamically and evaluated immediately. The intention is to enable easy experimentation with your neural network, but a side benefit is that now you can use your Python interpreter to debug in the same way as your normal Python program. Since this feature has not been included release 1.4, to use the eager execution mode, you will need to install the TensorFlow version from the master repository. You can clone and build from the latest master, or you can install directly from the nightly build, as follows:
sudo pip install tf-nightly
It may be a good practice to install the nightly build in a
isolate this non-official version of TensorFlow, until the feature is included
in an official release. Let’s try this out in a Python interpreter:
$ python Python 2.7.10 (default, Jul 15 2017, 17:16:57) [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> import tensorflow.contrib.eager as tfe >>> tfe.enable_eager_execution() >>> a = tf.random_uniform([4,4]) >>> a <tf.Tensor: id=6, shape=(4, 4), dtype=float32, numpy= array([[ 8.63772631e-02, 1.43377185e-01, 4.90468979e-01, 9.01059031e-01], [ 6.33775592e-01, 4.07707214e-01, 3.92198563e-04, 9.46395278e-01], [ 9.71757889e-01, 9.00770664e-01, 3.30629349e-01, 1.42947078e-01], [ 4.40206051e-01, 7.27719307e-01, 9.77952838e-01, 9.75356817e-01]], dtype=float32)>
You can see that in addition to the usual metadata for the tensor, the values of the tensor are now computed and available immediately without having to execute the full graph. Note that static and dynamic graphs are mutually exclusive, so you will have to make the choice at the beginning of the program. The default mode is static graph. Since the eager execution mode is in alpha release, it is not yet supported in many key TensorFlow functions such as distributed mode, TensorBoard, etc. As a result, this is currently useful to debug portions of your graph that involve only numerical computation. For more details, you can visit the TensorFlow Eager Execution page.
Logging is useful for debugging long running training sessions or processes servicing inferences. TensorFlow supports the usual logging mechanism, with 5 levels in order of increasing severity as follows:
Note that the logs are generated from the graph execution, which occurs in the runtime. Setting a particular log level will show all messages from that level and all levels more severe. You can set the log level in the program by:
Since the runtime is implemented in C++, you can also set the C++ environment variables:
export TF_CPP_MIN_VLOG_LEVEL=3 export TF_CPP_MIN_LOG_LEVEL=3
For the environment variables, the default value is 0, so all logs are shown.
TF_CPP_MIN_LOG_LEVEL to 1 to filter out
INFO logs and below, 2 to filter
WARN, 3 to filter out
ERROR, etc. If
TF_CPP_MIN_LOG_LEVEL is set, then
TF_CPP_MIN_VLOG_LEVEL is ignored.
There are also API calls to inject your own log messages from your program at the desired level:
For more details, please visit the TensorFlow logging page.
When your program seems to run correctly but is not producing the expected result, you will need to debug at a higher level, and TensorBoard can be useful for this purpose. TensorBoard is a visualization tool for post-mortem analysis: you need to add calls in your program to generate data and write to an event file. First you need to create the event file:
writer = tf.summary.FileWriter('./tensorflow_logs/mnist_deep', sess.graph)
As you generate data, you can push to the file by:
writer.add_summary(summary, i) merged = tf.summary.merge_all()
And remember to close the file handle before exiting your program:
Please refer to the module
tf.summary for the
complete API for working with TensorBoard data. After your program has
completed, you can run TensorBoard against this data:
TensorBoard runs as a web server, so you can access on the browser using the link provided. The API supports simple graph and histogram of any tensor, for example:
tf.summary.scalar('loss', cross_entropy) tf.summary.histogram('softmax', y)
The API also supports audio and image data, allowing you to verify the input for training or the transformed data within the neural network:
tf.summary.image('input', x_image, 4)
For instance, you can display the images after convolution:
Viewing the graph that implements your neural network is useful for spotting errors in the implementation. To make the graph more readable, add names your tensors and operations:
x = tf.placeholder(tf.float32, [None, 784], name='x') with tf.name_scope('optimizer'): train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy, name='train_step')
The names will be used to display the graph for your neural network:
There is also support for advanced visualization of the clustering behavior in your tensors. For more details, please visit the TensorBoard page.
In this tutorial, we have seen how TensorFlow programs are different from your typical procedural programs and why debugging them requires some adjustment. It is helpful to be able to reuse the debugging practices we are familiar with as well as a few newer techniques.