Article

Key strategies for enhancing RAG effectiveness

RAG optimization techniques for providing AI models with high-quality data

By

Shabna MT,

Rakesh Polepeddi,

Gourab Sarkar

Retrieval-augmented generation (RAG) is an AI framework for improving the quality of LLM-generated responses by grounding the model on external sources of knowledge to supplement the LLM's internal representation of information.

Retrieval augmented generation works in two steps:

  1. Retrieval step: When presented with a question or prompt, the RAG process retrieves a set of relevant documents or passages from a large corpus of documents.
  2. Generation step: The relevant passages that are retrieved are fed into a large language model along with the original query to generate a response.

For example, if the prompt was, “Can I provision a floating IP if I don't have a public gateway?” then the retrieval step would find documentation related to floating IPs and public gateways, and send them to the model along with the query. That way, the model can focus on understanding the question better and create a more relevant response.

In enterprise environments, RAG systems typically rely on external knowledge sources like product search engines or vector databases. When using vector databases, the process can be further split into the following tasks:

  1. Content segmentation. Breaking down large text documents into smaller, manageable chunks.
  2. Vectorization. Transforming these segments into numerical representations (vectors) suitable for machine learning algorithms.
  3. Vector database indexing. Storing these vectors in a specialized database optimized for similarity search.
  4. Retrieval and prompting. When generating responses, the system retrieves the most relevant segments from the vector database and uses them to construct prompts for the language model.

In this article, we explore various optimization techniques and discover practical strategies for taking your RAG implementations to the next level in terms of performance and impact.

Advanced RAG introduces a range of optimizations, both before and after the retrieval process, to enhance accuracy, efficiency, and relevance. Pre-retrieval optimizations involve techniques such as data preprocessing to reduce noise, efficient chunking strategies to maintain context, and advanced search capabilities like denseor hybrid retrieval. Post-retrieval optimizations involve techniques, such as re-ranking the retrieval results, enhancing contextual relevance, and using and crafting effective prompts. Let's consider the transformations to make data AI-ready for efficient vectorization and retrieval.

Data preprocessing for AI-ready data

To prepare data for efficient use with AI, specifically for efficient vectorization and retrieval, there's no single best method. The ideal approach depends on the data type and file format. Often, combining techniques yields the optimal outcome. Therefore, selecting the right tools is crucial, considering the data's nature, the AI application's use case, and the required retrieval methods.

Let's consider the following file formats: PDF (.pdf), Microsoft Word (.doc and .docx), Microsoft PowerPoint (.ppt and .pptx), Markdown (.md), and plain text (.txt).

The top four challenges for the above-mentioned file types include:

  1. Inconsistent formatting. Files might have inconsistent line breaks, spaces, and tabs, which need to be normalized.
  2. Unstructured data. Files might contain unstructured data, requiring additional steps to extract meaningful information.
  3. Non-standardized content. Documents might have tables and different section layouts that need to be processed correctly.
  4. Document noise. Document noise refers to any irrelevant or extraneous information in a document, such as extra spaces, special characters, headers, footers, or formatting issues, that can hinder data processing and analysis.

The Environment Safety.pdf sample file demonstrates these challenges. Using the data in its current form is not likely to be optimal for semantic retrieval. An optimally extracted data looks like this environment-safety-extracted.txt plain text file. Data structured in this format and chunked into a vector database helps to enhance semantic retrieval and accuracy.

Some of the most recent and popular open-source tools available for data preprocessing include: LibreOffice, pdfplumber, Tesseract, Poppler, Apache Tika, NLTK, libmagic, and Langchain Unstructured.

While the extraction tools are available, the optimal choice depends on your specific needs and document format. Leveraging our experience in recent AI projects, we have outlined a few data extraction techniques across various file formats focussing on specific use cases.

The demo code that is referenced in this article in this GitHub repo as a reusable Jupyter Notebook.

Extracting data from PDF files

UnstructuredFileLoader from Langchain offers a simple yet powerful solution for diverse file format extraction. The following code snippet demos the usage of the UnstructuredFileLoader from Langchain.

from langchain_community.document_loaders import UnstructuredFileLoader
# Specify the path to your PDF file
pdf_file_path = 'file-path'
loader = UnstructuredFileLoader(pdf_file_path)
docs = loader.load()

UnstructuredFileLoader excels at handling straightforward PDF extraction tasks. However, because of limited OCR (Optical Character Recognition) capabilities, UnstructuredFileLoader is not a great choice when the PDF has complex data content such as images or tables and when table metadata extraction is required.

Pdfplumber is an alternative way to extract data from PDF when it contains tables, images, and text. It provides and efficient way of extracting structured text, tables, and metadata from PDF files with precision. The following code snippet demos the usage of the Pdfplumber.

import pdfplumber
# Specify the path to your PDF file and  Open the PDF file
pdf_path = "file_path"
with pdfplumber.open(pdf_path) as pdf:
    # Iterate through each page
    for i, page in enumerate(pdf.pages):
        # Extract text from the page
        text = page.extract_text()   
        # Extract tables from the page
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)

Extracting data from text files

Extracting and transforming raw text files (.txt) into usable data is a multi-step process. The steps involve efficiently reading the contents of your text file and standardizing text format including lowercase conversion, punctuation removal, whitespace elimination, stop word removal, lemmatization, and handling special characters. The following code snippet demonstrates a sample extracting and transforming of raw text files.

##Read File
def read_file(file_path):
    encoding = detect_encoding(file_path)
    with open(file_path, 'r', encoding=encoding) as f:
        return f.read()
text_content = read_file(file_path)

    # Remove multiple spaces and tabs
    text = re.sub(r'\s+', ' ', text_content)  
    # Remove punctuation and special characters except periods
    text = re.sub(r'[^\w\s\.]', '', text) 
    # Convert to lowercase
    text = text.lower() 
    # Strip leading and trailing whitespace
    text = text.strip()

    ## Remove Stop Words
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    filtered_words = [word for word in words if word not in stop_words]

   ##Lemmatize Text
    lemmatizer = WordNetLemmatizer()
    words = word_tokenize(filtered_words)
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

Extracting data from Microsoft Word and Microsoft PowerPoint files

Apache Tika is a content analysis toolkit that can detect and extract metadata and text from various document types, including Microsoft Word and Microsoft PowerPoint files. The following code snippet demonstrates Microsoft Word Document and PowerPoint Presentation extraction using Tika:

#Import parser from tika
from tika import parser
file_path = "file_path"
parsed = parser.from_file(file_path)
if parsed and parsed["content"]:
    print(parsed["content"])

Extracting data from Markdown files

UnstructuredFileLoader from Langchain is powerful for extracting data from Markdown files (.md). It internally uses UnstructuredMarkDownFileLoader.

The following code snippet demos the usage of the UnstructuredFileLoader from Langchain:

from langchain_community.document_loaders import UnstructuredFileLoader
md_file_path = 'file-path'
loader = UnstructuredFileLoader(md_file_path)
docs = loader.load()
print(docs)

Efficient chunking and context preservation

Efficient chunking is important for minimizing semantic drift and maintaining contextual relevance in retrieved information. Chunking involves breaking down large text documents into smaller, more manageable chunks.

Key considerations for efficient chunking:

  • Chunk size. The optimal chunk size depends on the specific use case, model, and embedder capabilities. Smaller chunks can improve retrieval speed but might sacrifice context while larger chunks can retain more context but might have performance impact.
  • Context preservation. It's essential to preserve the context within each chunk. Preserving context is very data-source-specific and various strategies can be employed based on the nature of the data. The strategies can vary starting with simple techniques like fixed-size chunking, sentence-based chunking to more advanced ones like semantic chunking. In the sliding window technique, you include a portion of the previous chunk in the current one to maintain continuity, and it is a very commonly used pattern for context preservation.

Some of the most common chunking strategies include:

  • Overlapping chunks. Allow overlaps between chunks to preserve continuity across splits.
  • Semantic chunking. Use sentence or paragraph boundaries to create contextually meaningful chunks.
  • Fixed-size chunking. Split text based on a token or character count but fine-tune the chunks to avoid mid-sentence breaks.
  • Dynamic chunking. Adjust the chunk size dynamically based on semantic coherence or predefined priorities.

Enhancing search capabilities with hybrid retrieval

By combining the strengths of both vector search and keyword search, dense orhybrid retrieval offers a more robust and accurate approach to information retrieval. When dealing with highly specialized domains and queries, not all retrieved information is directly relevant and mostly only 60% of retrieved chunks align with the query's specific terminology. Hybrid retrieval typically is useful when there are large and diverse data sets, specialized domain data sets, user queries are ambiguous, handling edge eases in search, and improving search accuracy.

Following are the steps on how a simple hybrid retrieval is done:

  1. Keyword extraction. Identify key terms from the user query using either KeyBERT or KeyLLM.
  2. Keyword search. Conduct a keyword search on the product search engine or the vector database data chunks using the extracted keywords.
  3. Semantic similarity search. Perform a similarity search on the results from the keyword search to identify semantically related content.

Effective query phrasing

By crafting clear and specific queries, we can significantly improve the quality and relevance of the retrieval results and thereby generated responses. Key strategies for effective query phrasing include avoiding ambiguous queries and providing enough context to narrow down the search scope.

To guide users towards effective query phrasing, some of the query recommendation techniques that can be adopted are:

  1. Provide a list of pre-defined questions that the system is designed to answer.
  2. Provide suggestions for expanding acronyms and abbreviations within the query. These suggestions can be sourced from a pre-defined list tailored to the specific domain of the data source.
  3. Recommend queries that have been frequently used or highly upvoted by other users.

Independent inferencing

When the knowledge source that comprises your RAG system is heterogenous and search spans across various data sources, it's challenging to maintain high-quality results. Especially when the same information appears in different formats and contexts (for example, FAQ documents, conversational systems, and ticketing systems).

A recommended approach is to independently search each data source and provide answers with links to their original sources. This method offers several benefits:

  • Improved accuracy. Reduces the risk of hallucinations or incorrect answers when we combine data from diverse sources.
  • Enhanced relevance. Delivers more relevant answers, even for queries with limited or no direct matches as the retrieval is more narrow and source-specific.
  • Contextual understanding. Allows for a better understanding of the information, considering the context of each source.

While this approach can lead to more distributed answers, it also presents potential challenges:

  • Contradictions. Different sources may provide conflicting information.
  • Redundancy. Similar information may appear in multiple sources.

Prioritization of documents

When dealing with a large and diverse data set, it's important to make sure that crucial documents like FAQs and user guides are easily found. By tagging and storing these important documents separately in a system like a vector database (for example, Milvus or Elasticsearch), we can improve search accuracy and help users quickly find what they're looking for. Here's how we can achieve this using Milvus and Elasticsearch.

Milvus: Multi vector search

The following steps show how you can use Milvus Vector DB for prioritized document search.

  1. Create separate partitions:

    • Store highly important documents in a dedicated partition.
    • Store other documents in a common partition.
  2. Search both partitions:

    • Use Milvus to search both partitions simultaneously. AnnSearchRequest is used for this.
    • Retrieve the top x results from the priority partition and the top y results from the common partition.
  3. Combine and rank results:

    • Merge the results from both partitions.
    • Use a ranking algorithm like reciprocal rank fusion (RRF) or Weighted Ranker to optimize result relevance.
  4. Returned top N most relevant results can used for response generation

MilvusIllustration

Elastic: Custom implementation

The following steps show how you can use ElasticDB for prioritized document search

  1. Create a separate index:

    • Create a prioritized index for highly important documents.
    • Create common index for other documents.
  2. Perform a keyword search on both indexes:

    • Retrieve top x (say 100) results from the common index.
    • Retrieve top x (say 10) results from the prioritized index.
  3. Perform a semantic search on both indexes:

    • Retrieve top x (say 10) results from the common index.
    • Retrieve top x (say 3)results from the prioritized index.
  4. Combine and rank results:

    • Merge the results from both indexes.
    • Use a ranking algorithm like Reciprocal Rank Fusion (RRF) or Weighted Ranker to optimize result relevance.
  5. Refine the search with BM25.

    • Returned top 10 most relevant results are further refined with BM25 and used for response generation.

ElasticDBIllustration

Summary

Ensuring the success of RAG systems requires careful consideration and testing. A key step is transforming diverse data into a suitable AI-friendly format. By selecting the right tools and prioritizing clean, structured extraction, we can provide AI models with high-quality data. RAG optimization techniques can be tailored to the specific knowledge source and user behavior. Combining multiple approaches often yields the best results.