Learning Path

Get started with Data Prep Kit (DPK)

Learn how to easily prepare unstructured data for building LLM applications via fine-tuning or RAG

Overview

In this learning path, learn how to use Data Prep Kit (DPK) to prepare data for large language model (LLM) applications.

Skill level

This learning path assumes basic Python skills as a prerequisite and uses Google Colab as the cloud-based Jupyter notebook environment.

Estimated time to complete

Approximately 2 hours.

Learning objectives

With this learning path, you learn:

  • The fundamental concepts and features of Data Prep Kit (DPK) for building LLM applications
  • The practical aspects of data ingestion
  • How to extract data from various sources like PDFs, HTML, and code, and convert the data into tokens suitable for LLMs and vector databases
  • Ethical considerations for data preparation, and how trasnforms like license filtering, hate abuse profanity (HAP) detection, and PII redaction help users in preparing data
  • How to build DPK transforms and integrate them into the RAG and fine tuning pipelines using DPK

By completing this learning path, you'll learn how to apply your knowledge and skills to real-world data preparation for LLM applications like RAG and fine tuning.