Guiding Llama 2 with prompt engineering by developing system and instruction prompts

The Python code samples used in this tutorial to implement these prompt engineering best practices are available in this GitHub repo.

Prompt engineering is the practice of guiding large language model (LLM) outputs by providing the model context on the type of information to generate. Depending on the LLM, prompts can take the form of text, images, or even audio. Embedding refers to the process of encoding any kind of information into numerical format by representing key features of the input as a numerical vector. The LLM can then perform mathematical operations on these embeddings to generate a desired output. Once a user inputs a prompt, it is embedded and then sent to the model, acting as an instruction set for how the model should generate its output.

Llama 2 and prompt engineering

Llama 2 is one of the most popular (LLMs) released by Meta in July, 2023. The model performs exceptionally well on a wide variety of performance metrics, even rivaling OpenAI’s GPT 4 in many cases. Llama 2 is one of few completely open-source models that has no restrictions on both academic and commercial use (for products and services with less than 700 million monthly users). Llama 2 has 7, 13 (model card), and 70 billion (model card) parameter variants. This offers the user the choice between compute resources required for inference and model performance. Meta has also fine-tuned Llama 2 for chat and code generation downstream use cases by re-training the model on specific code and chat datasets, offering improved performance in these domains.

Llama 2 has been trained for a variety of tasks, and is built with a decoder-only architecture. Decoder models are designed to generate contextually relevant outputs based on a given input. Thus, LLama 2 performs best on text generation, text completion, and dialogue-based tasks. The fine-tuned Llama-2-chat variant is also particularly useful for chatbot use cases. When inputting instruction-based prompts like, “List 5 capitals of US States in a list format,” you might find that Llama generates additional unwanted text after returning the output. Finally, Meta states that Llama 2 is not suitable for outputs in languages other than English, as the model’s training corpus comprises mainly English texts.

Learn more about watsonx.ai, part of IBM watsonx, the portfolio of AI products for the enterprise.

Should you want to perform natural language processing (NLP) tasks, instruction-based tasks, or any task that requires the model to synthesize information, you might prefer models with an encoder-decoder architecture like FLAN-UL2-20b. These models excel at “encoding” inputted information into representations that they can better extract meaning from, and then using the decoder part of their architecture to generate the final output. Common uses of these models include tasks like summarization, machine translation, and speech recognition. Finally, FLAN is a multilingual model, and is suitable for generating outputs in multiple languages including English, French, German, and Romanian.

One of the main features of Llama 2 is it’s doubled context window compared to it’s predecessor. The context window refers to the amount of information (measured in tokens) that the model remembers when generating outputs. A token is a unit of text that a model can take as an input, ranging from one character to one word for text data. Practically for the users, this doubled context window size results in increased model output quality because the model “remembers” a greater number of input prompt tokens before discarding old ones, hence the name “context window”.

You can learn more about tokenization in this tutorial, “Tokenizing text in Python.”

Another key feature of Llama 2 is “ghost attention”, which is a new spin on the “attention” mechanism introduced with the creation of the transformer model architecture. The attention layer of a foundation model or neural network helps the model understand which parts of the input are the most important when computing the output. The scientists who invented Llama 2 propose the “ghost attention” mechanism. During fine tuning, the chat variant was fine tuned to generate outputs using system and instruction prompt.

This separation increases model output quality by increasing the weight given to the system prompt, as opposed to letting the model’s attention layer autonomously decide which parts of the input to give the most weight to. System prompts should contain overall context like “the model should generate responses as if it were speaking like a pirate,” while instruction prompts can contain instructions for how the model should generate outputs and labelled examples. Since extra weight is given to the system prompt, the model is better able to factor in the overall context when following instructions, increasing overall response quality. This is further detailed in the Llama 2 research paper.

Getting started with prompt engineering using Llama-2-Chat

The screen captures in this tutorial are from the watsonx.ai Prompt Lab, a GUI-based no-code tool to quickly test different models and prompts. Using the prompt lab, you can quickly see the difference in outputs between prompts formatted with correct system and instruction prompts, and those without them.

To integrate llama-2-chat into your apps, you can instead leverage the Python SDK to call the watsonx.ai API and receive your model outputs as JSON responses. We recommend using Watson Studio Jupyter Notebooks, but you can also develop locally and make calls to the Watson Machine Learning API. A full notebook with code samples to follow along with is provided on GitHub. The notebook contains examples of properly formatted prompts and helper functions to properly add system and instruction prompts to your code.

prompt lab

Prerequisites

To follow this tutorial, you need:

An IBM Cloud account and a watsonx.ai trial account.
Basic knowledge of Python and Jupyter notebooks (if using the Python SDK).

Steps

Step 1. Create a watsonx.ai project

Log in to watsonx.ai by using your IBM Cloud account.
Create a watsonx.ai project by clicking the + sign in the upper right of the Projects box.

create proj

Step 2. Associate a Watson Machine Learning instance to your project

Using the Navigation menu on the top left, navigate to the services catalog by selecting Administration > Services > Services catalog.
Provision a free instance of Watson Machine Learning using the Lite plan.
Navigate to the Manage tab of your watsonx.ai project.
On the left side menu, select Services and Integration.
Click the Associate service button.
Associate your Watson Machine Learning instance with your project.

Step 3: Create and open a Jupyter Notebook or Prompt Lab session

Navigate to the Assets tab of your project, and then click New asset.
In the Work with models section, select Work with data models in Python or R notebooks or Experiment with foundation models and build prompts.
If you are opening a notebook, name your asset, and optionally, give it a description.

If you are using Prompt Lab, you will name the prompt session when you save the prompt.
If you chose to use a Jupyter notebook, make sure that a Python runtime is selected using the Select Runtime drop-down. The Spark runtimes and NLP runtimes are applicable to big data and NLP use cases specifically. If you are unsure which runtime to choose, select the latest version of the standard Python runtime. Spark and NLP are not applicable to this tutorial.
Select llama-2-13b-chat foundation model as the model.

Step 4: Define the prompts

The llama-2-chat model uses the following format to define system and instruction prompts:

<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>
{{ user_message }} [/INST]

Let’s break down the different parts of the prompt structure:

<s>: the beginning of the entire sequence.
<<SYS>>: the beginning of the system message.
<</SYS>>: the end of the system message.
[INST]: the beginning of some instructions.
[/INST]: the end of some instructions.
{{ system_prompt }}: Where the user should edit the system prompt to give overall context to model responses.
{{ user_message }}: Where the user should provide instructions to the model for generating outputs.

Here is an example of a full prompt a user might send to a Llama-chat model using a virtual assistant that is designed to only provide helpful responses without any hateful or harmful content. The user will ask the model how to respond if there is a llama in their garden:

<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

There's a llama in my garden 😱 What should I do?
[/INST]

If you are coding in python, we have also prepared a sample code template and helper function in the Jupyter notebook to easily send prompts to Llama:

B_INST, E_INST = "<s>[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""

SYSTEM_PROMPT = B_SYS + DEFAULT_SYSTEM_PROMPT + E_SYS

def get_prompt(instruction):
    prompt_template =  B_INST + SYSTEM_PROMPT + instruction + E_INST
    return prompt_template

You can call the get_prompt() function to get a perfectly formatted Llama prompt to send to the LLM. To edit the system prompt, simply edit the DEFAULT_SYSTEM_PROMPT string. Provide your instructions by passing in the instruction argument to the function.

Summary and next steps

By using the Llama 2 ghost attention mechanism, watsonx.ai users can significantly improve their Llama 2 model outputs. The model recognizes system prompts and user instructions for prompt engineering and will provide more in-context answers when this prompt template.

By using Prompt Lab, one can easily experiment with different prompts in a UI-based, no-code tool for prompt engineering. With the watsonx.ai Python SDK, users can integrate LLMs into their apps and make calls to the API to retrieve model responses.

Next, perhaps you can explore the retrieval augmented generation (RAG) technique to enhance the retrieval accuracy and improve the quality of LLM-generated responses in this article, “Retrieval augmented generation with large language models from watsonx.ai.”