What is InstructLab and why do developers need it?

Large Language Models (LLMs) have the ability to fuel a diverse array of practical applications, like having a virtual assistant that can chat with you, help you with coding, or even generate content for your next marketing email. That’s the power of LLMs.

Sure, it’s impressive to see these models effortlessly generate coherent text or provide insightful responses and offer insightful solutions to business challenges. But here’s the catch: While LLMs offer a plethora of possibilities, there are boundaries to what you can do with them. The real complexity lies beneath the surface, in the intricate web of computations required to bring these models to life for specific business needs.

AI practitioners often find themselves needing to tweak pre-trained LLMs to fit specific business needs, but there are limits to how much you can modify these models. For example, to adjust the parameters and train the open source model Llama 2-70B from Meta.ai, you need approximately 6000 GPUs and you’d need to run the training for about 12 days, which ends up costing approximately $2 million.

Fine-tuning pretrained LLMs on downstream data sets does lead to significant performance improvements compared to using them out-of-the-box (for example, in zero-shot inference). However, this process presents challenges such as being time-consuming and computationally expensive (as shown with the Llama 2-70B model).

As LLMs continue to grow in size, the feasibility of fully fine-tuning them on consumer hardware becomes increasingly difficult. Additionally, the cost of storing and deploying fine-tuned models independently for each downstream task becomes super expensive, as these models maintain the same size as the original retrained model. And LLM refinements typically require large amounts of human-generated data, which can be time-consuming and again gets expensive.

This is where InstructLab steps in, providing a solution to the challenges faced by AI practitioners in fine-tuning and deploying LLMs for specific business needs. InstructLab takes a different approach to overcome these limitations.

How InstructLab helps enhance LLMs

InstructLab is an open-source AI project, developed by IBM and Red Hat, that is designed to enhance LLMs through community contributions. It streamlines the process of training LLMs by facilitating the submission of skills and knowledge from the community.

InstructLab uses a combination of these processes:

Taxonomy-driven data curation
Large-scale synthetic data generation
A multi-phased instruction-tuning method

Let’s examine the InstructLab approach carefully, breaking it down step by step.

Taxonomy-driven data curation

In simple terms, a taxonomy is like a tree structure that organizes things into categories and subcategories. In our context, taxonomy classifies data samples into smaller groups based on different tasks.

It has three main branches:

Knowledge is like a big branch in our tree of information. It is divided into different types of documents, like textbooks or technical manuals, and then into specific topics like finance or statistics. Each topic has its own set of documents and sample questions and answers. This helps us make sure we’re using the right documents for our training, and we only choose ones that we have permission to use.
Foundational skills: These are the basic skills that our model needs to learn first, like math, coding, language skills, and reasoning. To teach these skills, we use datasets that are available to the public. This helps our model get ready to learn more complex skills later on.
Compositional skills: These are skills that combine knowledge and foundational skills to answer complex questions. For example, if our model needs to write an email about a company’s performance, it needs to understand financial terms, do math, and know how to write a formal email. These skills work together to help our model tackle tough tasks.

Each main branch is further divided into more specific levels where tasks are defined. The tasks are represented as leaves on the branches and are illustrated with manually written instruction-response pairs. This structuring makes it easy to identify missing tasks and add them to the training data. New tasks can be added to the taxonomy by creating a new category under the appropriate branch. See the following taxonomy tree graphic, as seen in the Lab: Large-Scale Alignment for Chatbots paper.

Source: Lab: Large-Scale Alignment for Chatbots paper

Large-scale synthetic data generation

Large-scale synthetic generation involves these stages or phases:

Generate instructions. In this first phase, the teacher model serves as a question generator, using a specialized prompt to use its extensive knowledge to create diverse questions. By iterating through each leaf node of a taxonomy, the teacher model generates queries that adhere to specific principles. This stage explores the targeted domain, ensuring the comprehensiveness of the generated content.
Evaluate the synthetic instructions. In this next phase, the teacher model acts as an instruction evaluator. By using targeted prompts, it filters out questions that do not meet predefined criteria, such as their relevance to the domain, potential harm, or feasibility within the LLMs answering capabilities. This stage ensures that only high-quality, contextually appropriate questions progress further in the process.
Generate responses. In this next phase, the teacher model functions as a response generator, adopting dual personas for precision and creativity, guided by distinct prompts. This tailored approach allows for the generation of creative responses (for writing or role play) and precise responses (for STEM or data extraction). This response style is aligned to human expectations using principles and seed examples from the leaf nodes in the taxonomy tree.
Evaluate the synthetic instruction-response pair. In this final stage, the teacher model meticulously evaluates each instruction-response pair and filters out responses that are incorrect, irrelevant, or deviate from the provided principles. This process ensures that both the quality and relevance of the training data set are enhanced for the student model.

This phased approach helps maintain the integrity and reliability of the synthetic data generated for training purposes.

Multi-phased instruction-tuning framework

The final step in the InstructLab process is retraining the model using the synthetic data. InstructLab uses a multi-phase tuning framework to ensure diversity and quality in data generation and to ensure training stability and preventing catastrophic forgetting.

In the data generation phase, InstructLab concentrates on creating a data set for instruction tuning that is both diverse and high in quality. InstructLab captures a broad spectrum of scenarios such that the generated data accurately represents real-world situations. This diversity in the data is crucial for effectively training the model and enabling the model to tackle various tasks with precision and dependability.

The multi-phased tuning framework helps prevent catastrophic forgetting. Catastrophic forgetting occurs when a model loses previously learned information upon learning new data. To help prevent catastrophic forgetting, InstructLab integrates replay buffers that act as a memory bank for the LLM. By revisiting and retraining on earlier data periodically, the LLM retains its knowledge while acquiring new skills, which ensures stable and consistent performance over time.

Why does catastrophic forgetting occur? Neural networks are typically optimized for specific tasks by adjusting their weights through back-propagation. When trained on a new task, the model’s weights are readjusted to fit the new task’s data. Improper management of these adjustments can overwrite information from previous tasks, leading to catastrophic forgetting.

Consider training a model to recognize fruits from images. Initially, it learns to identify apples and performs well. Next, you train the same model to recognize oranges. During this new training, adjustments to improve orange recognition may degrade the model’s ability to recognize apples, despite its prior proficiency.

How does a replay buffer help prevent catastrophic forgetting? A replay buffer acts as a storage system, retaining a subset of data samples from earlier training phases. As the model progresses to new training phases, it periodically revisits and trains on this stored data alongside new data. This helps the model maintain knowledge from earlier phases while acquiring new information.

InstructLab’s multi-phase tuning framework has two phases:

Knowledge tuning
Skill tuning

InstructLab uses researched benchmarks to select the optimal model checkpoint.

Knowledge tuning

The knowledge tuning phase in the initial step in the InstructLab training process and it is essential for enhancing the LLM’s foundational understanding. Knowledge tuning consists of two steps:

Short response training
- Data: Training on samples with short responses from the knowledge and foundational skills branches
- Replay buffer: Not applicable after this step.
Long response training
- Data: Training on samples with longer responses to ensure the LLM can handle complex information.
- Replay buffer: Data from step 1 is included to prevent forgetting of earlier learning.

Skill tuning

After completing knowledge tuning, InstructLab moves on to skill tuning.

Data: Training on samples from the compositional skills branch, integrating multiple foundational skills
Replay buffer: Includes data from both knowledge tuning steps to mitigate catastrophic forgetting.

Approaches for model adaptation: Prompt tuning, fine tuning, and InstructLab

There are three approaches for model adaptation:

Prompt tuning – Adjusts large foundational models for new tasks using soft prompts integrated with data. Models’ weights are frozen.
Fine tuning – Optimizes the performance of pre-trained models for specific tasks by updating model parameters.
InstructLab – Enhances large language models by combining taxonomy-driven data curation, large-scale synthetic data generation, and a multi-phase tuning framework. Prevents catastrophic forgetting using a replay buffer mechanism.

Let’s explore each method to determine the ideal choice for different scenarios. See the following table.

Feature	Prompt tuning	Fine tuning	InstructLab
Use case	Tailoring model output to match the desired style and tone for specific tasks, such as generating product descriptions.	Adapting pre-trained models to highly specialized domains, such as medical diagnosis based on patient symptoms.	Augmenting LLMs with specialized skills relevant to industries like banking, insurance, telecom, or retail for call center operations.
Training process	Craft specific text prompts to guide the model’s response without retraining or updating model weights.	Train pre-trained models further on domain-specific data sets to adjust model parameters.	Use a multi-phase tuning framework, including knowledge tuning and skill tuning, to enhance model performance and mitigate catastrophic forgetting. Use replay buffers to retain earlier learning and select optimal model checkpoints based on benchmark assessments.
Benefits	Efficient and effective for adapting models to new tasks without retraining, maintaining control over model behavior.	Optimizes model performance for specific domains, improving accuracy and suitability for target tasks.	Simplifies the model training process, reduces resource requirements, and allows ongoing enhancement through community contributions. Uses large-scale synthetic data generation. Mitigates catastrophic forgetting by preserving earlier learning, making it suitable for deployment behind a firewall where data access can be restricted.
Disadvantages	Limited flexibility in fine-tuning model behavior beyond the scope of provided prompts, potentially limiting adaptability.	Requires large amounts of domain-specific data for effective fine-tuning, which may not always be available.	Relies on the quality and diversity of community-contributed data, which may vary and require additional curation and validation.

Summary and next steps

InstructLab, a groundbreaking AI project developed by IBM and Red Hat, offers developers a powerful toolset to enhance Large Language Models (LLMs) for specific business needs. By addressing the challenges of fine-tuning and deploying LLMs, InstructLab streamlines the training process through taxonomy-driven data curation, large-scale synthetic data generation, and a multi-phased instruction-tuning framework.

For developers, InstructLab offers several compelling benefits:

Simplified Training Process. InstructLab simplifies the complex task of fine-tuning LLMs, reducing time and resource requirements.
Cost-Efficiency. By utilizing synthetic data generation, InstructLab lowers the cost barriers associated with real-world data collection, making model enhancement more accessible.
Community Collaboration. Developers can tap into a collaborative ecosystem, benefiting from a collective pool of knowledge and expertise for data curation and model enhancement.
Mitigation of Catastrophic Forgetting. InstructLab’s framework ensures stable performance over time, crucial for developers seeking reliable model behaviour.

InstructLab empowers developers to unleash the full potential of LLMs, offering a streamlined training process, cost-efficiency, community collaboration, and stability in model performance. Embracing InstructLab is not just about building better models; it's about shaping the future of AI development.

Ready to try it out? Learn how to tune pre-trained LLMs using InstructLab in this tutorial.

To get started, join the InstructLab community in GitHub. You can also explore IBM foundation models from IBM watsonx.ai that are designed to support knowledge and skills contributed by the open source community.

Acknowledgements

This article is adapted from my article on Medium, “InstructLab – Ever imagined the ease of tuning pre-trained LLMs? InstructLab makes it a reality. Let’s delve into how it sets itself apart from other model tuning methods.”