Article

Architecture overview of Data Prep Kit (DPK)

Understanding the modular architecture of Data Prep Kit

By

Aanchal Goyal,

David Wood

Data Prep Kit (DPK) is a versatile framework that developers can use to streamline data processing for AI applications. Its modular architecture enables developers to rapidly create and deploy custom data transforms. The DPK architecture is designed to enable developers to quickly create these transforms and easily deploy them to process data. It is designed to be data agnostic, and as such, it can support use cases in natural language and code data modalities. Moreover, developers can add new data preparation modules to satisfy their domain-specific data processing needs.

Developers, data scientists, and data engineers can all use DPK to build scalable data preparation pipelines using frameworks such as Ray or Spark. These data preparation pipelines can be run in a low-code fashion by using Kubeflow Pipelines KFP UI. DPK is fully scalable from running the data preparation modules on just a laptop to a production-level cluster.

The following figure shows a detailed DPK architecture. However, in this article, we briefly describe only the important components. More complete technical details of the DPK architecture are available in our DPK technical paper.

DPK architecture

Core components of DPK

The DPK architecture is composed of three fundamental components:

  1. Data access components
  2. Transformation components
  3. Runtime components

Data access components

The Data Access components provide a unified interface for interacting with diverse data sources, including local file systems, Amazon S3 (Simple Storage Service)-compatible storage, and more.

These components identify and select target data sources, read and write data in supported formats (like Apache Parquet), and implement checkpointing for robust job recovery.

You can configure the Data Access components using command-line arguments, completely independent of individual transforms and runtime settings.

Transform components

The Transform components run specific data processing operations, such as conversion, de-duplication, and entity identification.

These components can support various data transformation patterns:

  • 1:1 – A single data object is transformed into a single transformed data object, such as annotating each row with a model score.
  • 1:N – A single data object is transformed into multiple data objects, such as splitting of rows based on a given criterion.
  • N:1 – Multiple data objects are aggregated into a single object, such as combining multiple rows to a single row.
  • N:M – Any number of data objects are converted to any number of data objects, such as sorting data into data objects of a specific type.

Also, these components provide a flexible framework for creating custom transforms while also offering built-in support for common operations like table transformations.

You can configure the Transform components using command-line arguments to tailor the behavior of the transforms to specific use cases.

Runtime components

The Runtime components manage the execution environment for the data transforms, distributing the work across multiple workers all while monitoring the progress of the transforms.

These components support various runtime environments, including Python, Ray, and Spark. It orchestrates the execution of the data transforms by assigning tasks to available workers. Additionally, it provides mechanisms for checkpointing and fault tolerance.

You can configure the Runtime components using command-line arguments to control the execution environment and resource allocation.

Summary

Data Prep Kit is a scalable, flexible, robust, and easy to use framework for data processing. Data Prep Kit is data agnostic, handling diverse data formats including text, code, and structured data. It uses distributed computing frameworks like Ray and Spark to efficiently process large data sets. Developers can create custom transforms (to supplement the built-in transforms) to readily address specific data processing needs. Lastly, it incorporates checkpointing and fault tolerance mechanisms to ensure reliable execution.