Summary

In this learning path, you got an overview of the Data Prep Kit. The learning path covered:

The fundamental concepts and features of Data Prep Kit (DPK) for building LLM applications
The practical aspects of data ingestion
How to extract data from various sources like PDFs, HTML, and code, and convert the data into tokens suitable for LLMs and vector databases
Ethical considerations for data preparation, and how trasnforms like license filtering, hate abuse profanity (HAP) detection, and PII redaction help users in preparing data
How to build DPK transforms and integrate them into the RAG and fine tuning pipelines using DPK

Next steps

Explore the Data Prep Kit project in the data-prep-kit repo. If you find it empowers your work, join our growing community by giving us a star!

Want to learn more?

IBM’s newest launch, watsonx.data integration, leverages the power of Data Prep Kit (DPK) to simplify unstructured data ingestion, transformation, and processing, but brings a scalable, repeatable, and easily maintainable data pipeline approach. Paired with our hybrid lakehouse, watsonx.data, users can work across both structured and unstructured sources to build powerful, more accurate retrieval applications!

Get started with Data Prep Kit (DPK)

Summary

Get started with Data Prep Kit (DPK)

Summary

Scaling data prep workflows