Tutorial
Train open source LLMs with collected knowledge with InstructLab
Create knowledge, generate synthetic data, and then train the model
On this page
InstructLab is a model-agnostic open source AI project that facilitates contributions to large language models (LLMs). It is a new community-based approach to build truly open-source LLMs. To learn more about InstructLab, read this article, "What is InstructLab, and why do developers need it."
InstructLab uses a synthetic-data-based alignment tuning method to train LLMs. The InstructLab tuning method is driven by manually created taxonomies. InstructLab provides a process for optimizing and tuning LLMs by collecting knowledge and skills as part of a taxonomy tree.
In this tutorial, you learn how to train an open source LLM to know all about the movie Oppenheimer by creating a knowledge base and training the model on it in InstructLab. Knowledge is comprised of data and facts, which are supported by documents. The knowledge base that you create in this tutorial isn't accepted for contributing to the InstructLab taxonomy tree right now, but you can still try it out in your local and understand how to train the model with a knowledge base.
Prerequisites
You need to install the InstructLab CLI (ilab
). You can follow the instructions from my previous tutorial, or you can refer to the instructions in the InstructLab repo to install the ilab
CLI.
Steps
Step 1. Download the base LLM
If you completed my other tutorial on creating a skill with InstructLab, then you can skip this step.
After you have installed the InstructLab CLI on your system, you can start by downloading the base model that you want to train. You can find the supported open source models from HuggingFace. The default is merlinite-7b-lab-Q4_K_M, which you need to use the 4 bit Quantized version of it for this tutorial.
In the terminal, run the following command to initialize the ilab cli:
ilab config init --non-interactive
To download the model, run this command:
ilab model download
To download a different base model you can run the following command:
ilab model download --repository <huggingface_repo> --filename <filename>.gguf
Step 2. Create the required files
To train an open source model with InstructLab, you need to create a knowledge base in the taxonomy
directory. When you initilized the ilab cli, it automatically cloned the InstructLab taxonomy repository, which is the source of truth for your model training.
To create a knowledge base, you need to create a qna.yaml
file and an attribution.txt
file. Then, you need to create a Public GitHub repo and load all the knowledge files in MD format.
In the qna.yaml
file you must reference the supporting document by specifying the repo, the SHA of the commit to your repo, and the glob patterns specifying the markdown files (such as *.md
).
Here is the template for creating a knowledge qna.yaml
file:
Here is the template for creating a knowledge attribution.txt
file:
Create the qna.yaml
and attribution.yaml
in the taxonomy/knowledge/movies/oppenheimer/
directory of the cloned repo.
It is recommended to have 5 or more examples in the qna.yaml
file for the knowledge base. Copy the qna.yaml and attribution.txt file from my github repo in to the taxonomy/knowledge/movies/oppenheimer/qna.yaml
directory of the cloned InstructLab repo.
Step 3. Serve the base model
Open two terminals and source into your ilab virtual environment. In the first terminal, run the following to serve the model:
ilab model serve
If you want to serve a different model, run this command:
ilab model serve --model-path <modelpath>.gguf
You should see an output similar to below:
INFO 2024-05-30 17:24:41,256 lab.py:320 Using model 'models/merlinite-7b-lab-Q4_K_M.gguf' with -1 gpu-layers and 4096 max context size.
INFO 2024-05-30 17:24:41,659 server.py:206 Starting server process, press CTRL+C to shutdown server...
INFO 2024-05-30 17:24:41,659 server.py:207 After application startup complete see http://127.0.0.1:8000/docs for API.
Keep this terminal open to generate synthetic data, train the model, and test the model.
Step 4. Test the base model output by chatting with it
Now in the second terminal, run the following command to chat with the base model and see if the model knows about the movie Oppenheimer.
ilab model chat -gm
An interactive shell will be presented where you can chat with the model. Go ahead and ask the model to tell you anything about the movie Oppenheimer, such as “Who starred in the movie Oppenheimer?” or “What are the release dates for Oppenheimer movie?”.
The model output without training for the “Who starred in the movie Oppenheimer?” query looks something like this:
The model output without training for the “What are the release dates for Oppenheimer movie?” query looks something like this:
As you can see from this output, the model was last updated on May 17th 2022 and doesn’t have the knowledge about the new events. You will train the model with the Oppenheimer movie details and evaluate the results.
Step 5. Generate synthetic data
In the same second terminal, run the following command:
ilab data generate
To generate more than 100 samples, run the following command:
ilab data generate --num-instructions <int>
You can see the new synthetic data sets getting generated in the output. If you are not satisfied with the generated data set, you can quit the process by pressing ctrl + c
. Modify the examples in the qna.yaml
file and then rerun the generate
command.
This process will take some time depending upon your system. It took about 30min in my M1 Mac Pro. You can see the ETA in output.
Once the synthetic data is generated, you will see a summary of how many samples have been generated and how many have been discarded. Samples might be discarded by the critic model for format or by rogue score threshold.
Example Output:
100 instructions generated, 12 discarded due to format (see generated/discarded_merlinite-7b-lab-Q4_K_M_2024-05-30T17_24_56.log), 1 discarded due to rouge score
A directory will also be generated directory in the ilab
directory. You can see four files:
- Discarded data set (log file)
- Generated data set (json file)
- Train data set (jsonl file)
- Test data set (jsonl file)
Step 6. Train the model
Once the synthetic data is ready, all you have to do is run the following command in your terminal to train the model:
ilab model train
If you want to use a GPU for training, you can run the following command:
ilab model train --device 'cuda'
This process will take some time depending upon your system and the number of iterations. It took approximately 30 minutes on my M1 MacBook Pro to complete 100 iterations. You can see the ETA in output.
A new directory will be created in the ilab
directory, with a name similar to this: instructlab-merlinite-7b-lab
. This directory will have the new model weights and adapters.
Step 7. Test the model
The InstructLab can also run basic tests to ensure model correctness. In your terminal, run the following:
ilab model test
You can see the output where it shows model output before and after training.
If you are training on a MacOS computer, you need to quantize the model to run it on your Mac. In terminal, run the following:
ilab model convert
All the weights and adapters will be converted to a quantized gguf model after running the command. A directory will be created in the ilab
directory, with a name similar to this: instructlab-merlinite-7b-lab-trained
.
Step 8. Serve and chat with the trained model
Go back to the first terminal where you had served the base model, hit ctrl+c
and stop the model serving. Run the following command to serve the newly trained model:
ilab model serve --model-path instructlab-granite-7b-lab-trained/instructlab-granite-7b-lab-Q4_K_M.gguf
In the second terminal where you generated synthetic dataset, trained the model, tested the model, run the following command to chat with the model:
ilab model chat -gm -m instructlab-granite-7b-lab-trained/instructlab-granite-7b-lab-Q4_K_M.gguf
You can now ask the model anything about the movie Oppenheimer and the model should be able to answer it!
The model output after training for the “Who starred in the movie Oppenheimer?” query looks something like this:
The model output after training for the “What are the release dates for Oppenheimer movie?” query looks something like this:
Summary and next steps
In this tutorial, you learned how to create a knoweldge base for the movie Oppenheimer. After you set up the InstructLab CLI, you downloaded the base model and trained it using the qna.yaml
file. Then, you tested your fine-tuned model and chatted with it.
To get started, join the InstructLab community in GitHub, and create other knoweldge bases and contribute them to the InstructLab taxonomy tree by raising a pull request. You can also explore IBM foundation models from IBM watsonx.ai studio that are designed to support knowledge and skills contributed by the open source community.
Acknowledgements
I thank my teammates Syeda Ameena Begum, Suman P, Suyash Gupte, Surya Deep Singh, Ravi Kumar Srirangam, Sumeet Kapoor and Amol Dhondse for encouraging me to write these InstructLab tutorials and review the content.