Retrieval augmented generation with large language models from watsonx.ai

Retrieval augmented generation (RAG) is a technique for enhancing the retrieval accuracy and improving the quality of large language model (LLM)-generated responses with data that is fetched from external sources. RAG is also called in-context learning. This article starts with the implementation of a RAG pattern that uses the flan-ul2 model of watsonx.ai. It continues with the evaluation of the RAG solution and an example of a smart chunking technique. The data is about a fictitious company’s policies and is synthetically generated. Chroma DB, in memory, is used as the vector database.

The RAG scenario is very simple and introductory. The scenario’s focus is on the chunking techniques and evaluation of the RAG pipeline. The data and the example are illustrative, and the idea is to explore the various chunking techniques and evaluate the performance to determine the optimal technique that works best for the given use case. The GitHub repository contains the notebooks with detailed explanations of the steps and code. The notebooks have been developed and tested on IBM watsonx.ai AI studio, and you are encouraged to use it to execute the notebooks.

The example solution covers the implementation of a RAG approach for company policies. The evaluation of the solution uses the Llamaindex library for evaluation metrics to determine the optimal chunk size for the use case. Finally, the article looks at smart chunking techniques with a focus on document segmentation.

To follow along with this article, you should have a good understanding and access to the following technologies:

Working knowledge of Python and Jupyter notebooks
Knowledge of LLMs
Access to IBM Cloud and watsonx.ai Enterprise Studio
Basic understanding of RAG, Langchain, and Llamaindex

GitHub repository with data and notebooks

The associated GitHub repository contains the data and notebooks used in this article. The company policies Microsoft® Word document in the data folder contains the various policies that were created for the company. The same data is available in the companypolicies.txt file, which is the file used in the code to make the data handling simple. Some policies are put in individual text files for the smart chunking approach notebook.

There are three notebooks in the repository, and each notebook can be executed independently. You can start with the simple-rag.ipynb notebook to understand how the RAG approach has been implemented using the watsonx.ai flan-ul2 model, Langchain, and ChromaDB.

Then, you can run the rag-evaluation.ipynb evaluation notebook. This notebook covers the evaluation of the solution using the Llamaindex library and is a little more technically advanced. However, step-by-step instructions are provided to make it easier to understand.

The third notebook, smart-chunking.ipynb, explores an option with smart chunking and segmenting the documents based on the subtitles. The document segmentation approaches are discussed later in this article. You are encouraged to read the entire article before executing the notebooks. It is also recommended to use watsonx.ai Enterprise Studio to execute the notebooks.

Solution setup

At a high level, to run the solution:

Clone the GitHub repository.
Generate an IAM key on IBM Cloud if you haven’t already done this.
Create Cloud Object Storage and Watson Machine Learning services.
Create a project in the watsonx.ai Enterprise Studio.
Associate a Watson Machine Learning service with the project.
Upload the notebooks under the Assets tab of the project. This can be done by clicking New Task, selecting the tile with the name Working with data and models with Python and R notebooks, and clicking Local File.

Retrieval augmented generation scenario with flan-ul2 model

Flan-ul2 is an encoder-decoder model that is based on the T5 architecture and has 20 billion parameters. To begin looking at the scenario, I'll start with the question and answering solution. The notebook has step-by-step details and is fairly self-explanatory. Some details regarding the model creation and splitting the document into chunks are elaborated on in this article. You should read the entire section to understand the concepts before executing the notebook. Note: Keep the Cloud IAM key ready before execution.

The document is split into chunks of size 1000 and with an overlap of 0. This means that each chunk is discrete and does not have any common content with the next chunk. The values are chosen randomly. (The approach to determine the optimal chunk size is covered in the evaluation section.) The logs of this cell might show different chunk sizes and not exactly 1000, which is the behavior of the CharacterTextSplitter parameter. The following code example shows the code for chunking.

CharacterTextSplitter paramet4r

Model creation has a sequence of steps, as shown in the following code example, and is wrapped in the WatsonxLLM class.

Model creation

Queries can be executed using the RetrievalQA class of Langchain. Try changing the query to various policies and observe the output. The appropriate chunk is retrieved from ChromaDB and sent to the flan-ul2 model to generate the output. Code has been provided in the notebook to run similar search queries on ChromaDB, and you can run the same query on both cells and observe the summary generated by the flan-ul2 model.

Retrieval evaluation

After implementing the simple RAG approach for the data, it’s time to evaluate the accuracy of the information retrieval logic. The suggested approach is to have a ground truth data set and use it as a reference to compare the retrieved answers. There are many frameworks such as Rouge and Ragas for this purpose.

In the second notebook, Llamaindex is used for evaluation of the RAG pipeline. The questions for the ground truth are generated by using the DatasetGenerator of Llamaindex. This notebook uses the sentence transformers as the embedding model. The evaluate function computes the average response time, faithfulness, and relevance metrics for each chunk size. You can analyze the response to determine the optimal chunk size for the given data set. Chunk sizes of 256, 512, and 1000 have been considered for the data set, and the results have been analyzed to determine the optimal chunk size.

You can now run the rag-evaluation.ipynb notebook. Pay attention to the embedding model creation and see how it is used in ServiceContext to override the default OpenAI models. The notebook contains details and instructions to guide you during its execution.

Chunking techniques

While working with the LLMs, the input data needs to be extracted from multiple document formats. This can be challenging itself, and having various optimal chunk sizes across the data sets makes it even more difficult. The output generated by the LLMs would match the expectations if the input data is more contextually relevant.

This section focuses on segmenting the documents and splitting them into chunks by their logical layout structure. Watson Discovery supports various document formats, and the Smart Document Understanding (SDU) feature helps in splitting the document by using the logical elements of the data. The company’s policies document is split into multiple documents by using the subtitle of each policy. Therefore, each policy is a separate document and is ingested individually in the vector store. This approach leverages the standard capabilities of Watson Discovery and supports multiple document formats.

The SDU for the company's policies is ingested as a Microsoft Word document in Watson Discovery, as shown in the following image. SDU supports various document elements such as title, text, and questions, as shown on the right in the image. The structure of the document can be described visually using the available options. Each page must be described and submitted. For further instructions, see the Watson Discovery documentation.

SDU pages

After splitting the document into multiple chunks, you can use the query language of Watson Discovery to fetch the documents, and the individual files can be created using a simple Python script. The following image shows a sample query.

Sample query

The individual policy files have been created and are available in the GitHub repository, and you can use them for the third notebook, smart-chunking.ipynb. Now, you can execute the smart-chunking.ipynb notebook.

Explorations are underway to segment the document into contextually relevant chunks instead of Naïve chunking approaches. Multiple frameworks and libraries must be used to support different document formats. Currently, there is no solution that supports multiple document formats and facilitates customized chunking.

Summary

The article covered the implementation of a RAG solution for a fictitious company’s policies data and the details of the information retrieval evaluation. It also explained segmenting the document based on the logical structure and the implementation using Watson Discovery’s feature. The data and notebooks are available in the GitHub repository and can be executed in watsonx.ai Enterprise Studio.

References

Introduction to Langchain
Simple introduction to retrieval-augmented generation (RAG) with watsonx.ai
Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex
Using Document Layout Structure for Efficient RAG on substack.com