About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Tutorial
Tuning pre-trained LLMs using InstructLab
Generating synthetic data to enhance LLMs
On this page
To contribute to a large language model (LLM) using InstructLab, the easiest approach resembles a pull request in software development, and it entails crafting a skill.
This process requires two essential components: a qna.yaml
file and a separate text file that provides attribution for the content. The YAML file contains structured data, which helps organize the information for the model. Think of it as a straightforward text file with structured formatting.
InstructLab uses a selection of these skills to generate a more extensive set of synthetic data related to the provided examples.
However, there’s a limit to the amount of content that the model can process effectively. As a result, contributors should ensure that the question and answer pairs in the qna.yaml
file don’t exceed approximately 2300 words. By adhering to this limit, contributors can help maintain the quality and efficiency of the model’s training process.
Let’s start with a simple example. We’ll enhance the model to provide meeting summaries by training it with InstructLab. Using InstructLab, we’ll generate, train, deploy, and evaluate the chat model using the provided qna.yaml file for generating meeting transcript summary data.
Prerequisites
For the initial setup process, please follow the step-by-step instructions outlined in the InstructLab README.
Steps
Contribute skills.
First, create a qna.yaml file containing the meeting transcript summary data, including the minutes of the meeting, attendees, agenda, discussions, and action items. You can view my example qna.yaml file in my GitHub repo.
Optionally, you can provide a separate text file that details the attribution of the content, such as who created it and where it came from.
List and validate your data by running the
ilab diff
command to list your new data and ensure it’s registered correctly in the taxonomy path.ilab diff
Generate a synthetic data set by running the
ilab generate
command to generate a synthetic dataset based on the newly added skill set in the taxonomy repository.ilab generate
This step may take from 15 minutes to 1+ hours to complete, depending on your computing resources.
Train the model locally on Linux/Mac M Series, by running the
ilab train
command.ilab train
This step can potentially take several hours to complete depending on your computing resources. The trained model will be outputted in the models directory.
Test the newly trained model by running the
ilab test
command to test the model and verify its performance.ilab test
Serve the newly trained model. First, you need to stop any existing server by entering ctrl+c keys. Then, convert the newly trained model using the
ilab convert
command.#Serve the model locally via ilab serve --model-path <New model name>
Chat with the new fine-tuned model using the chat interface by running this command:
ilab chat -m <New model name>
Submit your contribution!
If you’ve improved the model, open a pull request in the taxonomy repository to include the files with your improved data.
Following these steps will allow you to contribute the meeting transcript summary data to the model and train a new model based on it, enhancing its capabilities in generating meeting minutes.
Summary and next steps
In this tutorial, you learned how to use InstructLab to enhance a model's capabilities in generating meeting summaries. Following meticulous data creation, the model underwent training, deployment, and evaluation stages using InstructLab, aiming to generate concise meeting summaries effectively.
By using these steps with InstructLab, you can train the model well. These steps created a specialized skill designed to enhance the efficiency of generating meeting summaries. Through the meticulous crafting of this skill, you can optimize the process, ensuring that meeting discussions are distilled into clear and concise summaries for improved comprehension and productivity.
To get started, join the InstructLab community in GitHub. You can also explore IBM foundation models from IBM watsonx.ai studio that are designed to support knowledge and skills contributed by the open source community.
The following foundation models support community contributions from InstructLab:
- granite-7b-lab
- granite-13b-chat-v2
- granite-20b-multilingual
- merlinite-7b
Acknowledgements
This tutorial is adapted from my article on Medium, “InstructLab – Ever imagined the ease of tuning pre-trained LLMs? InstructLab makes it a reality. Let’s delve into how it sets itself apart from other model tuning methods.”