Skill Level: Advanced

Data pipelines are generally very complex and difficult to test. Unfortunately, there are not many well-documented strategies or best-practices to test data pipelines. In this recipe, we'll present a high-level guide to testing your data pipelines.


Data pipelines are generally very complex and difficult to test. Most of data pipelines, batch as well as streaming, are modeled around directed acyclic graphs workflows and utilize distributed big data frameworks such as Apache Hadoop, Apache Spark, or Apache Storm. A graph-based structure and distributed nature make testing data pipelines a lot harder than contemporary applications. Normally, these data pipelines handle a large volume of data ingested at high velocity. This demands rigorous non-functional testing to characterize performance, load, fault-tolerance in addition to normal functional tests. Although data pipelines are replacing ETL the concepts of ETL testing are still relevant such as problem resolution and data preparation. For instance, fault tolerance or ability to handle "dirty data" that violates business rules still required to be tested.


Anatomy of data pipelines

Before we deep dive into the various strategies to test your data pipeline, it will be good to cover the anatomy of a typical data pipeline. A typical data pipeline ingests data from various data sources (data ingress), then processes the data using a pipeline or workflow, and finally redirects the processed data to appropriate destinations (data egress).

Directed acyclic graph

As mentioned before, a data pipeline or workflow can be best described as a directed acyclic graph (DAG). A graph consists of a set of vertices or nodes connected by edges. When edges are directed from one node to another node the graph is called directed graph. A directed acyclic graph contains no cycles. Based on these definitions, a data pipeline is a sequence of jobs (aka tasks) connected by data input and output. In a stream processing model, jobs can nodes of the graph and data stream can be edges. In contrast, a batch processing mode will have jobs as edges and data file as nodes. A sub-DAG is basically a subgraph.




Figure-1: An example data processing pipeline - directed acyclic graph with nodes and edges.


Test input, output, and oracle

A test case is a set of test inputs, corresponding expected outputs, a set of fixtures describing pretest conditions. A test suite is a group of related test cases. Test cases are then are exercised on a system under testing. During test execution under given test conditions, for a set of inputs system under test maps to a set of output. These outputs are then compared by a test oracle with expected outputs described in the test case to determine whether or not test is passed.




  1. Identify tests and their scope

    There are various types of functional test we can exercise on data pipelines. To determine what type of test we should or could run, first we have to define system under test  i.e. scope and scale of functional testing. Normally one should make any decision about the scope based on following key considerations:  time to set up, time to feedback, time to develop, maintenance overhead, coverage, reliability, and overall cost.

    Unit tests: Typically a class is used as a system under test. Unit tests provide faster feedback. They are quick to develop, easy to set up and maintain. Unit tests are considered as best-practice so we should write them anyway. Unit tests are generally reliable but not sufficient to make a final view on the quality of a data pipeline.

    Component tests: Typically a job is used as a system under test. Testing an individual job is no different than a  unit test and overall characteristics are similar. But testing a large number of jobs is a totally different beast. Depending on the overall number of jobs, test coverage for all jobs can be slow to develop.

    Integration tests: You can start with two sequential jobs as a system under test and then expand the scope to include more jobs. As you expand the scope of integration test it gets harder to create code coverage, maintaining tests becomes a nightmare, environment setup gets equally hard.

    End-to-end tests:  An end-to-end test is a special variation of integration test which includes all or most of the jobs – pipeline DAG or sub-DAG as a system under test. For the end-to-end test, the feedback loop is slow, it requires higher maintenance, developing test coverage requires a lot more work, the test environment is very difficult to set up. An end-to-end test is quintessential to measure end-to-end performance and fault tolerance characteristics. This type of test may involve multiple teams spread across the organization which makes it more challenging from collaboration and cultural perspective.

    Black-box tests: Quite similar to end-to-end test. A pipeline is a system under test but in block box mode i.e we really don’t care what is happening inside the pipeline. We are only interested input and output. For a black-box test, overall characteristics are similar to an end-to-end test except they are fast to develop and easy to maintain.



    Figure-2 Varying level of scopes and scale for functional testing of a data pipeline.


    If you are developing a data pipeline from scratch, you will probably start with wiring test for one single job and slowly add the test for multiple jobs in the pipeline – iteratively and incrementally in a more agile fashion. Writing a test for a job is relatively easy but the heavy lifting is required to set up a test environment and identifying test input and test oracle.

  2. Setting of test environment

    As mentioned earlier, setting test environment is often the most challenging part of testing your data pipelines. Luckily nowadays, most of the test environments can be containerized and technically can run on a local machine. But if your system under test has the dependency on cloud-based services, setting up test environment can be bit tricky. For instance, a component integrated with on Amazon S3 or Amazon DynamoDB. This can partially mitigate by utilizing running a local version of these services but overall characteristics of these local clones may not be same as cloud version. This type of environment inconsistency can lead to regression defects.

  3. Generate dynamic test inputs

    Classical data preparation approaches rely on data profiling and data sampling. The end product of data preparation is a file (JSON, XML or CSV) with a set of test inputs and may be corresponding expected output. As a best-practice, we should generally avoid static test input. Static test inputs are hard to maintain and we can easily replace them with easy-to-use factories for the complex object. Instead, we can use language specific packages like Python Faker and Ruby Faker to write data scripts. These data scripts can generate test input for each test run on-the-fly using a set of rules and properties.

    There are several benefits of data scripts. First, data scripts can generate large test inputs with high-variance which helps to expose the defects including edge cases. For example, using data scripts you can easily generate a test input data set of 1000 or 10000 records. Second, scripts are less susceptible to change, requires less maintenance and can be reused.

  4. Define test oracle for your tests

    When defining a test oracle, we should avoid full record comparison. We should only compare the attributes relevant to test. Apart from comparing the values, a test oracle can perform validation for business rules i.e. minimum/maximum values, the validity of data type and valid attributes values based on certain constraints. Test oracle should also cater validation logic for scenarios like duplicate records (exactly-once processing).

    It is possible to generate expected output data along with test input. In this particular scenario, a test oracle is only responsible for comparing necessary attributes of expected output and output produced by SUT. For instance, let’s say we are writing a component test for a job which enriches every record with location attribute based IP address information encoded in each record. Rather than comparing the whole record we should be just comparing location attribute in output and expected output.

  5. Continuous integration of tests

    Finally, all test should be continuously integrated using a continuous integration (CI) system such as Jenkins. Ideally, continuous integration system will use containers to create ephemeral test environment which can be tear down after the test. Unit and component tests can be executed as soon as code for a job is committed in version control repository. Whereas tests like end-to-end can run as part of the nightly build.

Join The Discussion