Tutorial

Train open source LLMs with collected knowledge with InstructLab

Create knowledge, generate synthetic data, and then train the model

InstructLab is a model-agnostic open source AI project that facilitates contributions to large language models (LLMs). It is a new community-based approach to build truly open-source LLMs. To learn more about InstructLab, read this article, "What is InstructLab, and why do developers need it."

InstructLab uses a synthetic-data-based alignment tuning method to train LLMs. The InstructLab tuning method is driven by manually created taxonomies. InstructLab provides a process for optimizing and tuning LLMs by collecting knowledge and skills as part of a taxonomy tree.

In this tutorial, you learn how to train an open source LLM to know all about the movie Oppenheimer by creating a knowledge base and training the model on it in InstructLab. Knowledge is comprised of data and facts, which are supported by documents. The knowledge base that you create in this tutorial isn't accepted for contributing to the InstructLab taxonomy tree right now, but you can still try it out in your local and understand how to train the model with a knowledge base.

Prerequisites

You need to install the InstructLab CLI (ilab). You can follow the instructions from my previous tutorial, or you can refer to the instructions in the InstructLab repo to install the ilab CLI.

Steps

Step 1. Download the base LLM

If you completed my other tutorial on creating a skill with InstructLab, then you can skip this step.

After you have installed the InstructLab CLI on your system, you can start by downloading the base model that you want to train. You can find the supported open source models from HuggingFace. The default is merlinite-7b-lab-Q4_K_M, which you need to use the 4 bit Quantized version of it for this tutorial.

In the terminal, run the following command to initialize the ilab cli:

ilab config init --non-interactive

To download the model, run this command:

ilab model download

To download a different base model you can run the following command:

ilab model download --repository <huggingface_repo> --filename <filename>.gguf

Step 2. Create the required files

To train an open source model with InstructLab, you need to create a knowledge base in the taxonomy directory. When you initilized the ilab cli, it automatically cloned the InstructLab taxonomy repository, which is the source of truth for your model training.

To create a knowledge base, you need to create a qna.yaml file and an attribution.txt file. Then, you need to create a Public GitHub repo and load all the knowledge files in MD format.

In the qna.yaml file you must reference the supporting document by specifying the repo, the SHA of the commit to your repo, and the glob patterns specifying the markdown files (such as *.md).

Here is the template for creating a knowledge qna.yaml file:

Knowledge qna.yaml file

Here is the template for creating a knowledge attribution.txt file:

Knowledge attribution.txt file

Create the qna.yaml and attribution.yaml in the taxonomy/knowledge/movies/oppenheimer/ directory of the cloned repo.

It is recommended to have 5 or more examples in the qna.yaml file for the knowledge base. Copy the qna.yaml and attribution.txt file from my github repo in to the taxonomy/knowledge/movies/oppenheimer/qna.yaml directory of the cloned InstructLab repo.

Step 3. Serve the base model

Open two terminals and source into your ilab virtual environment. In the first terminal, run the following to serve the model:

ilab model serve

If you want to serve a different model, run this command:

ilab model serve --model-path <modelpath>.gguf

You should see an output similar to below:

INFO 2024-05-30 17:24:41,256 lab.py:320 Using model 'models/merlinite-7b-lab-Q4_K_M.gguf' with -1 gpu-layers and 4096 max context size.
INFO 2024-05-30 17:24:41,659 server.py:206 Starting server process, press CTRL+C to shutdown server...
INFO 2024-05-30 17:24:41,659 server.py:207 After application startup complete see http://127.0.0.1:8000/docs for API.

Keep this terminal open to generate synthetic data, train the model, and test the model.

Step 4. Test the base model output by chatting with it

Now in the second terminal, run the following command to chat with the base model and see if the model knows about the movie Oppenheimer.

ilab model chat -gm

An interactive shell will be presented where you can chat with the model. Go ahead and ask the model to tell you anything about the movie Oppenheimer, such as “Who starred in the movie Oppenheimer?” or “What are the release dates for Oppenheimer movie?”.

The model output without training for the “Who starred in the movie Oppenheimer?” query looks something like this:

Model output without training

The model output without training for the “What are the release dates for Oppenheimer movie?” query looks something like this:

Model output without training

As you can see from this output, the model was last updated on May 17th 2022 and doesn’t have the knowledge about the new events. You will train the model with the Oppenheimer movie details and evaluate the results.

Step 5. Generate synthetic data

In the same second terminal, run the following command:

ilab data generate

To generate more than 100 samples, run the following command:

ilab data generate --num-instructions <int>

You can see the new synthetic data sets getting generated in the output. If you are not satisfied with the generated data set, you can quit the process by pressing ctrl + c. Modify the examples in the qna.yaml file and then rerun the generate command.

This process will take some time depending upon your system. It took about 30min in my M1 Mac Pro. You can see the ETA in output.

Once the synthetic data is generated, you will see a summary of how many samples have been generated and how many have been discarded. Samples might be discarded by the critic model for format or by rogue score threshold.

Example Output:

100 instructions generated, 12 discarded due to format (see generated/discarded_merlinite-7b-lab-Q4_K_M_2024-05-30T17_24_56.log), 1 discarded due to rouge score

A directory will also be generated directory in the ilab directory. You can see four files:

  • Discarded data set (log file)
  • Generated data set (json file)
  • Train data set (jsonl file)
  • Test data set (jsonl file)

Step 6. Train the model

Once the synthetic data is ready, all you have to do is run the following command in your terminal to train the model:

ilab model train

If you want to use a GPU for training, you can run the following command:

ilab model train --device 'cuda'

This process will take some time depending upon your system and the number of iterations. It took approximately 30 minutes on my M1 MacBook Pro to complete 100 iterations. You can see the ETA in output.

A new directory will be created in the ilab directory, with a name similar to this: instructlab-merlinite-7b-lab. This directory will have the new model weights and adapters.

Step 7. Test the model

The InstructLab can also run basic tests to ensure model correctness. In your terminal, run the following:

ilab model test

You can see the output where it shows model output before and after training.

If you are training on a MacOS computer, you need to quantize the model to run it on your Mac. In terminal, run the following:

ilab model convert

All the weights and adapters will be converted to a quantized gguf model after running the command. A directory will be created in the ilab directory, with a name similar to this: instructlab-merlinite-7b-lab-trained.

Step 8. Serve and chat with the trained model

Go back to the first terminal where you had served the base model, hit ctrl+c and stop the model serving. Run the following command to serve the newly trained model:

ilab model serve --model-path instructlab-granite-7b-lab-trained/instructlab-granite-7b-lab-Q4_K_M.gguf

In the second terminal where you generated synthetic dataset, trained the model, tested the model, run the following command to chat with the model:

ilab model chat -gm -m instructlab-granite-7b-lab-trained/instructlab-granite-7b-lab-Q4_K_M.gguf

You can now ask the model anything about the movie Oppenheimer and the model should be able to answer it!

The model output after training for the “Who starred in the movie Oppenheimer?” query looks something like this:

Model output after training

The model output after training for the “What are the release dates for Oppenheimer movie?” query looks something like this:

Model output after training

Summary and next steps

In this tutorial, you learned how to create a knoweldge base for the movie Oppenheimer. After you set up the InstructLab CLI, you downloaded the base model and trained it using the qna.yaml file. Then, you tested your fine-tuned model and chatted with it.

To get started, join the InstructLab community in GitHub, and create other knoweldge bases and contribute them to the InstructLab taxonomy tree by raising a pull request. You can also explore IBM foundation models from IBM watsonx.ai studio that are designed to support knowledge and skills contributed by the open source community.

Acknowledgements

I thank my teammates Syeda Ameena Begum, Suman P, Suyash Gupte, Surya Deep Singh, Ravi Kumar Srirangam, Sumeet Kapoor and Amol Dhondse for encouraging me to write these InstructLab tutorials and review the content.