Data preprocessing is an integral part of any neural network and is often complex and expensive. Traditionally, these operations are carried out on the CPU, which creates a bottleneck in systems with higher GPU to CPU ratios. This limits the performance of training and inference due to the compute-intensive nature of traditional preprocessing operations. Additionally, the various deep learning frameworks have unique implementations of multiple preprocessing pipelines, making portability of data between frameworks difficult.

Why do we need to preprocess data?

Often, the data we want to feed to a model is unstructured, inconsistent, and incomplete. Using this raw data in a model as-is would provide uninteresting and likely error-ridden results. As the old saying goes, “garbage in, garbage out.” The solution to this is to preprocess the data into something clean and structured so the model can then use it to derive pertinent information. This could include standardizing the data using transformations, reformatting it into something the model understands, or simply padding out the data to effectively create more information. Preprocessing your data helps your model achieve better results and is thus an essential piece of any deep learning model. The abstraction used in many AI frameworks to handle these operations is called a pipeline.


Pipelining is the technique of distributing operations between multiple workers to reduce overall execution time. We will illustrate this concept using a simple example of a system with a single CPU and single GPU. A simple pipeline implementation may preprocess the data on the CPU and then pass it to the GPU serially as is shown below.

As you can see, the GPU is stalled while waiting for the next batch to make it through the preprocessing stage, and the CPU is stalled while waiting for the GPU to train on the current batch. Overall, this wastes computing power and as a result, the time to complete a batch is the sum of the time it takes to preprocess the data on the CPU plus the time it takes the model compute on the GPU.

A more robust pipeline implementation would use multiple threads. That is, one would be preprocessing the “next” batch on the CPU while another would be finishing the “current” batch on the GPU.

The time to complete a batch in this pipeline becomes the max between the total time preparing the data on the CPU and total time computing the model on the GPU. This parallel pipeline can be further improved on systems with multiple GPUs, although careful orchestration of the work is required.
There are many other factors to consider as you improve your pipeline: including transfer time, number of workers, number of threads, and so on, but the main idea of overlapping processes to hide latency remains the same.

The DALI Pipeline

NVIDIA Data Loading Library (DALI) is an implementation of a data preprocessing pipeline that allows these expensive operations to be outsourced to the GPUs. This single library can be integrated into different deep learning frameworks, making it easy to run models on the same sets of data in different frameworks. In summary, DALI provides both flexibility and performance to the preprocessing step of running a model.

DALI accomplishes this by using its native Pipeline type. This object is responsible for handling all the aforementioned operations to prepare data – including moving the data from the CPU to the GPU. To demonstrate how this construct works, let’s look at a simple pipeline implementation. The following example uses DALI in conjunction with the popular AI framework, TensorFlow, and operates on the ImageNet data set.

Prerequisites and Installation

DALI is provided in the form of a Conda package in IBM Watson Machine Learning Community Edition (WML CE) 1.6.1. Usage and installation is quick and easy; allowing you to get the most out of your models!

Prerequisites – Ensure that the following are installed and set up:

  1. Anaconda installed and the WML CE Conda channel added.
  2. NVIDIA GPU driver installed (Required if training with GPUs, optional otherwise).

Check the WML CE 1.6.1 Knowledge Center for details on installing above.

Install DALI and TensorFlow
Run the following commands to install DALI and TensorFlow:

  1. conda install dali
  2. conda install tensorflow-gpu

Before discussing the Pipeline itself, there is a bit of setup that needs to be done when working with TFRecords. For each TFRecord that will be input to the pipeline, a corresponding “index” file needs to be generated. These index files are used by DALI to properly share the data between multiple workers such as when running in a multi-GPU environment. Thankfully, DALI provides us with a script to convert TFRecords to the index files via the tfrecord2idx utility:

from subprocess import call
import os.path

tfrecord = "/path/to/imagenet/train-00001-of-01024"
tfrecord_idx = "index_files/train-00001-of-01024.idx"
tfrecord2idx = "tfrecord2idx"

if not os.path.exists("index_files"):

if not os.path.isfile(tfrecord_idx):
    call([tfrecord2idx, tfrecord, tfrecord_idx])

That is an example of a single record being generated but it can easily be extended to iterate over a list of TFRecords. The TFRecordReader call below can accept single files as well as a list of them.

Now let’s take a look at the Pipeline itself:

from nvidia.dali.pipeline import Pipeline
import nvidia.dali.ops as ops
import nvidia.dali.types as types
import nvidia.dali.tfrecord as tfrec

# BasicPipeline to be a derived class of DALI's native Pipeline type
class BasicPipeline(Pipeline):
    def __init__(self, batch_size, num_threads, device_id):
        super(BasicPipeline, self).__init__(batch_size,

        self.input = ops.TFRecordReader(path = tfrecord,
                                        index_path = tfrecord_idx,
                                        features = {"image/encoded": tfrec.FixedLenFeature([],tfrec.string, "")
                                            "image/class/label": tfrec.FixedLenFeature([1], tfrec.float32, 0.0),
                                            "image/class/text": tfrec.FixedLenFeature([], tfrec.string, ""),
                                            "image/object/bbox/xmin": tfrec.VarLenFeature(tfrec.float32, 0.0),
                                            "image/object/bbox/ymin": tfrec.VarLenFeature(tfrec.float32, 0.0),
                                            "image/object/bbox/xmax": tfrec.VarLenFeature(tfrec.float32, 0.0),
                                            "image/object/bbox/ymax": tfrec.VarLenFeature(tfrec.float32, 0.0)})

        self.decode = ops.nvJPEGDecoder(device = "mixed",
                                        output_type = types.RGB)

        self.resize = ops.Resize(device = "gpu",
                                 image_type = types.RGB,
                                 interp_type = types.INTERP_LINEAR,
                                 resize_shorter = 256.)

    # Define how the operations above will be used in the pipeline
    def define_graph(self):
        inputs = self.input()
        images = self.decode(inputs["images/encoded"])

        resized_images = self.resize(images)

        labels = inputs["image/class/label"]

        return (resized_images, labels)

There are several things to unpack here. First, it is important to note that our new BasicPipeline class has to be a subclass of dali.pipeline.Pipeline. This ensures that it inherits all of the functionality needed to create and launch the pipeline. The only methods that need to be overridden are the constructor and the define_graph function.

In the constructor, we first call the superclass constructor in order to set some parameters of the pipeline. We then define which operations the pipeline will be able to perform on its data, defined in the dali.ops module. In our simple example, we simply give our class the ability to resize the images, but DALI supports several other transformation operations. The define_graph function is where we describe the order in which we are going to perform these operations. As shown in the example above, we first read in the data via the TFRecordReader, we decode the images, and finally resize them. It is important to note these operations’ device parameter. Depending on what the operation supports, its behavior changes.

There are three main options to select from:

  • CPU – The operation takes place on the CPU
  • GPU – The operation takes place on the GPU
  • Mixed – The operation accepts input on the CPU and produces output on the GPU

Standalone Pipeline

Now that we have defined how the pipeline should work, we can build and run the pipeline by itself. This is done simply with:

pipe = BasicPipeline( 10, 4, 0 )

output =

Which gives us the following output corresponding to the image and label tensors:

[<nvidia.dali.backend_impl.TensorListGPU object at 0x7ffef4a5d6f8>, <nvidia.dali.backend_impl.TensorListCPU object at 0x7ffef4a5d730>]

Notice the first element, the images, is a TensorList object on the GPU while the second element, the labels, are on the CPU. This is because in define_graph we never applied any operations on the labels to put them on the GPU. An easy way move the labels to a GPU element is to return labels.gpu() at the end of define_graph. These returned elements have several useful debug features that allow some visibility into what your pipeline is doing before the data is sent to a model including is_dense_tensor() and as_array().

Using DALI with TensorFlow

DALI provides integration with TensorFlow in the form of a plugin. This plugin is implemented as a TensorFlow custom operator. It allows the pipeline to return data as TensorFlow Tensor objects, making it easy to plug it into your model. To start, we will import TensorFlow and the DALI TensorFlow plugin:

import tensorflow as tf
import as dali_tf

We can now use the method to get the Tenorflow operator that will give us the Tensors that will then be fed to the model:

daliop = dali_tf.DALIIterator()

with tf.device("/gpu:0")
    img,label = daliop(pipeline = pipe,
                       shapes = [(32, 3, 224, 224),()]
                       dtypes = [tf.float32, tf.int32])

img and label now contain TensorFlow Tensor objects that can be fed directly into your model. The tuple returned by daliop is compatible with tf.Session() and Estimators. However, in the latter case, the above code needs to be wrapped in a function to be passed as the input_fn parameter to the Estimator.

Performance consideration when training with GPUs

While DALI can provide a lot of value in terms of speeding up model training, it does come with some caveats. In systems with multiple GPUs, special care needs to be taken to make sure the pipeline is executing on the GPU you expect, and that doing so is not hurting performance with unnecessary data transfers between devices. The benefit of using the DALI pipeline is to preemptively move the data onto the same GPU you want to train on. It is possible to preprocess your data on one GPU and then send it to another for training. However, you will take a performance hit when transferring the data between GPUs. To avoid this, make sure your pipeline is producing the image and label tensors within the context of with tf.device("/gpu:X") where ‘X’ is the same GPU ID you’ve specified your model to run on. The aforementioned scenario, however, could be ideal if you have the resources to dedicate an entire GPU for preprocessing while the others run the model in parallel, assuming the time to transfer the data is small enough to make it feasible.
Furthermore, working with a single GPU for both preprocessing and training has its own potential pitfalls to consider. Once the pipeline has produced your tensors, it is important to make sure that there are no CPU-bound operations that occur on them before the model has a chance to run. Otherwise, there is a risk that the data will be copied back to the CPU after it has already been processed on the GPU.


To summarize, DALI is a useful and straightforward way to preprocess your data without having to do a deep dive into framework specific APIs that can quickly become complicated and difficult to understand. It is a simple matter of declaring your pipeline, defining what operations you want to perform on your data, and then passing that data to your model for training. DALI provides ease of use, performance, and flexibility. We are proud to provide access to this software as a part of WML – Community Edition 1.6.1.

Join The Discussion

Your email address will not be published. Required fields are marked *