Identify tests and their scope
There are various types of functional test we can exercise on data pipelines. To determine what type of test we should or could run, first we have to define system under test i.e. scope and scale of functional testing. Normally one should make any decision about the scope based on following key considerations: time to set up, time to feedback, time to develop, maintenance overhead, coverage, reliability, and overall cost.
Unit tests: Typically a class is used as a system under test. Unit tests provide faster feedback. They are quick to develop, easy to set up and maintain. Unit tests are considered as best-practice so we should write them anyway. Unit tests are generally reliable but not sufficient to make a final view on the quality of a data pipeline.
Component tests: Typically a job is used as a system under test. Testing an individual job is no different than a unit test and overall characteristics are similar. But testing a large number of jobs is a totally different beast. Depending on the overall number of jobs, test coverage for all jobs can be slow to develop.
Integration tests: You can start with two sequential jobs as a system under test and then expand the scope to include more jobs. As you expand the scope of integration test it gets harder to create code coverage, maintaining tests becomes a nightmare, environment setup gets equally hard.
End-to-end tests: An end-to-end test is a special variation of integration test which includes all or most of the jobs – pipeline DAG or sub-DAG as a system under test. For the end-to-end test, the feedback loop is slow, it requires higher maintenance, developing test coverage requires a lot more work, the test environment is very difficult to set up. An end-to-end test is quintessential to measure end-to-end performance and fault tolerance characteristics. This type of test may involve multiple teams spread across the organization which makes it more challenging from collaboration and cultural perspective.
Black-box tests: Quite similar to end-to-end test. A pipeline is a system under test but in block box mode i.e we really don’t care what is happening inside the pipeline. We are only interested input and output. For a black-box test, overall characteristics are similar to an end-to-end test except they are fast to develop and easy to maintain.
Figure-2 Varying level of scopes and scale for functional testing of a data pipeline.
If you are developing a data pipeline from scratch, you will probably start with wiring test for one single job and slowly add the test for multiple jobs in the pipeline – iteratively and incrementally in a more agile fashion. Writing a test for a job is relatively easy but the heavy lifting is required to set up a test environment and identifying test input and test oracle.
Setting of test environment
As mentioned earlier, setting test environment is often the most challenging part of testing your data pipelines. Luckily nowadays, most of the test environments can be containerized and technically can run on a local machine. But if your system under test has the dependency on cloud-based services, setting up test environment can be bit tricky. For instance, a component integrated with on Amazon S3 or Amazon DynamoDB. This can partially mitigate by utilizing running a local version of these services but overall characteristics of these local clones may not be same as cloud version. This type of environment inconsistency can lead to regression defects.
Generate dynamic test inputs
Classical data preparation approaches rely on data profiling and data sampling. The end product of data preparation is a file (JSON, XML or CSV) with a set of test inputs and may be corresponding expected output. As a best-practice, we should generally avoid static test input. Static test inputs are hard to maintain and we can easily replace them with easy-to-use factories for the complex object. Instead, we can use language specific packages like Python Faker and Ruby Faker to write data scripts. These data scripts can generate test input for each test run on-the-fly using a set of rules and properties.
There are several benefits of data scripts. First, data scripts can generate large test inputs with high-variance which helps to expose the defects including edge cases. For example, using data scripts you can easily generate a test input data set of 1000 or 10000 records. Second, scripts are less susceptible to change, requires less maintenance and can be reused.
Define test oracle for your tests
When defining a test oracle, we should avoid full record comparison. We should only compare the attributes relevant to test. Apart from comparing the values, a test oracle can perform validation for business rules i.e. minimum/maximum values, the validity of data type and valid attributes values based on certain constraints. Test oracle should also cater validation logic for scenarios like duplicate records (exactly-once processing).
It is possible to generate expected output data along with test input. In this particular scenario, a test oracle is only responsible for comparing necessary attributes of expected output and output produced by SUT. For instance, let’s say we are writing a component test for a job which enriches every record with location attribute based IP address information encoded in each record. Rather than comparing the whole record we should be just comparing location attribute in output and expected output.
Continuous integration of tests
Finally, all test should be continuously integrated using a continuous integration (CI) system such as Jenkins. Ideally, continuous integration system will use containers to create ephemeral test environment which can be tear down after the test. Unit and component tests can be executed as soon as code for a job is committed in version control repository. Whereas tests like end-to-end can run as part of the nightly build.