Tutorial

Tuning pre-trained LLMs using InstructLab

Generating synthetic data to enhance LLMs

To contribute to a large language model (LLM) using InstructLab, the easiest approach resembles a pull request in software development, and it entails crafting a skill.

This process requires two essential components: a qna.yaml file and a separate text file that provides attribution for the content. The YAML file contains structured data, which helps organize the information for the model. Think of it as a straightforward text file with structured formatting.

InstructLab uses a selection of these skills to generate a more extensive set of synthetic data related to the provided examples.

However, there’s a limit to the amount of content that the model can process effectively. As a result, contributors should ensure that the question and answer pairs in the qna.yaml file don’t exceed approximately 2300 words. By adhering to this limit, contributors can help maintain the quality and efficiency of the model’s training process.

Let’s start with a simple example. We’ll enhance the model to provide meeting summaries by training it with InstructLab. Using InstructLab, we’ll generate, train, deploy, and evaluate the chat model using the provided qna.yaml file for generating meeting transcript summary data.

Prerequisites

For the initial setup process, please follow the step-by-step instructions outlined in the InstructLab README.

Steps

  1. Contribute skills.

    First, create a qna.yaml file containing the meeting transcript summary data, including the minutes of the meeting, attendees, agenda, discussions, and action items. You can view my example qna.yaml file in my GitHub repo.

    Optionally, you can provide a separate text file that details the attribution of the content, such as who created it and where it came from.

  2. List and validate your data by running the ilab diff command to list your new data and ensure it’s registered correctly in the taxonomy path.

     ilab diff
    
  3. Generate a synthetic data set by running the ilab generate command to generate a synthetic dataset based on the newly added skill set in the taxonomy repository.

     ilab generate
    

    This step may take from 15 minutes to 1+ hours to complete, depending on your computing resources.

  4. Train the model locally on Linux/Mac M Series, by running the ilab train command.

    ilab train
    

    This step can potentially take several hours to complete depending on your computing resources. The trained model will be outputted in the models directory.

  5. Test the newly trained model by running the ilab test command to test the model and verify its performance.

     ilab test
    
  6. Serve the newly trained model. First, you need to stop any existing server by entering ctrl+c keys. Then, convert the newly trained model using the ilab convert command.

     #Serve the model locally via
     ilab serve --model-path <New model name>
    
  7. Chat with the new fine-tuned model using the chat interface by running this command:

    ilab chat -m <New model name>
    
  8. Submit your contribution!

    If you’ve improved the model, open a pull request in the taxonomy repository to include the files with your improved data.

Following these steps will allow you to contribute the meeting transcript summary data to the model and train a new model based on it, enhancing its capabilities in generating meeting minutes.

Summary and next steps

In this tutorial, you learned how to use InstructLab to enhance a model's capabilities in generating meeting summaries. Following meticulous data creation, the model underwent training, deployment, and evaluation stages using InstructLab, aiming to generate concise meeting summaries effectively.

By using these steps with InstructLab, you can train the model well. These steps created a specialized skill designed to enhance the efficiency of generating meeting summaries. Through the meticulous crafting of this skill, you can optimize the process, ensuring that meeting discussions are distilled into clear and concise summaries for improved comprehension and productivity.

To get started, join the InstructLab community in GitHub. You can also explore IBM foundation models from IBM watsonx.ai studio that are designed to support knowledge and skills contributed by the open source community.

The following foundation models support community contributions from InstructLab:

  • granite-7b-lab
  • granite-13b-chat-v2
  • granite-20b-multilingual
  • merlinite-7b

Acknowledgements

This tutorial is adapted from my article on Medium, “InstructLab – Ever imagined the ease of tuning pre-trained LLMs? InstructLab makes it a reality. Let’s delve into how it sets itself apart from other model tuning methods.”