Build a local AI co-pilot using IBM Granite Code, Ollama, and Continue

In this tutorial, I will show how to use a collection of open source components to run a feature-rich developer copilot in Visual Studio Code while meeting data privacy, licensing, and cost challenges common to enterprise users. The setup is powered by local large language models (LLMs) with IBM's open-source llm family, Granite. All components run on a developer's workstation and have business friendly licensing. For the quick version, just jump to the TL;DR script end-to-end setup script.

The developer world is quickly becoming the best place for AI developers to drink our own champagne with the promise of generative AI to accelerate our own work. There are numerous excellent AI co-pilot tools out there in the market (GitHub Copilot, tabnine, Sourcegraph Cody, watsonx Code Assistant to name just a few). These tools offer in-editor chatbots, code completion, code explanation, test generation, auto-documentation, and a host of other developer-centric tools. Unfortunately, for many of us, these tools sit out of reach behind corporate data privacy policies (yes, we can access watsonx Code Assistant here at IBM, but the rest are not available.)

There are three main barriers to adopting these tools in an enterprise setting:

Data Privacy: Many corporations have privacy regulations that prohibit sending internal code or data to third party services.
Generated Material Licensing: Many models, even those with permissive usage licenses, do not disclose their training data and therefore may produce output that is derived from training material with licensing restrictions.
Cost: Many of these tools are paid solutions which require investment by the organization. For larger organizations, this would often include paid support and maintenance contracts which can be extremely costly and slow to negotiate.

In this tutorial, I will show how I solved all of these problems using IBM's Granite Models, Ollama, Visual Studio Code, and Continue.

Architecture of local copilot

Figure. Architecture of a local co-pilot

Step 1. Install Ollama

The first problem to solve is avoiding the need to send code to a remote service. One of the most widely used tools in the AI world right now is Ollama which wraps the underlying model serving project llama.cpp. The ollama CLI makes it seamless to run LLMs on a developer's workstation, using the OpenAI API with the /completions and /chat/completions endpoints. Users can take advantage of available GPU resources and offload to CPU where needed. My workstation is a MacBook Pro with an Apple M3 Max and 64GB of shared memory which means I have roughly 45GB of usable VRAM to run models with! Users with less powerful hardware can still use ollama with smaller models and/or models with higher levels of quantization.

On a Mac workstation, the simplest way to install ollama is via their webpage: https://ollama.com/download. This will install a menu-bar app to run the ollama server in the background and keep you up-to-date with the latest releases.

You can also install ollama with homebrew:

brew install ollama

If installing from brew, or building from source, you need to boot the central server:

ollama serve

Step 2. Fetch the Granite models

The second problem to solve is choosing a model that gives high-quality output and was trained on enterprise safe data. There are numerous good code models available on the ollama library and huggingface. According to this paper published by IBM Research titled, "Granite 3.0 Language Models", the IBM Granite models meticulously curated their training data to ensure all training code carried enterprise-friendly licenses and all text did not contain any hate, abuse, or profanity. Since generated material licensing is one of the primary concerns I've already identified, and since I work for IBM, I chose this family of models for my own use.

Granite comes in a range of sizes and architectures to fit your workstation's available resources. Generally, the bigger dense models perform best, but require more resources and will be slower. I chose the 8b dense option as my starting point for chat and the Granite Code 3b option for autocomplete. Ollama offers a convenient pull feature to download models:

ollama pull granite3.2:8b
ollama pull granite-code:3b

In addition to the language models for chat and code generation, you will need a strong embedding model to enable the Retrieval Augmented Generation (RAG) capabilities of Continue. The Granite family also contains strong, lightweight embedding models. I chose granite-embedding:30m since my code is entirely in english and the 30m model performs well at a fraction of the weights of other leading models. You can pull it too!

ollama pull granite-embedding:30m

Step 3. Set up Continue

With the Granite models available and ollama running, it's time to start using them in your editor. The first step is to get Continue installed into Visual Studio Code. This can be done with a quick command line call:

code --install-extension continue.continue

Alternately, you can install continue using the extensions tab in VS Code:

Open the Extensions tab.
Search for "continue."
Click the Install button.

Next, you need to configure Continue to use your Granite models with Ollama.

Continue Command Palette

Open the command palette (Press Ctrl/Cmd+Shift+P)
Select Continue: Open config.json.

This will open the central config file ($HOME/.continue/config.json by default) in your editor. To enable your ollama Granite models, you'll need to edit two sections:

models: This will set up the model to use for chat and long-form prompts (e.g. explain)

  "models": [
    {
      "title": "Granite 3.2 8b",
      "provider": "ollama",
      "model": "granite3.2:8b"
    }
  ],

tabAutocompleteModel: This will set up the model to use for inline completions

  "tabAutocompleteModel": {
    "title": "Granite Code 3b",
    "provider": "ollama",
    "model": "granite-code:3b"
  },

embeddingsProvider: This will set up the embedding model to use for indexing your code

  "embeddingsProvider": {
      "provider": "ollama",
      "model": "granite-embedding:30m",
      "maxChunkSize": 512
  },

Step 4. Give your co-pilot a try!

With continue installed and Granite running, you should be ready to try out your new local AI co-pilot. Click the new continue icon in your sidebar:

Continue sidebar

You can follow the usage guidelines in the documentation.

Next steps: Extend the framework

Once you're off the ground with the basic setup, there are lots of great ways to extend the framework to fit your personal needs.

Setting up custom commands

One of the great features of continue is the ability to develop your own prompt-engineered commands. This can all be done in the "customCommands" section of the core config.json.

As an example, I created the /list-comprehension command to help with refactoring python code to use list/dict comprehensions wherever possible:

  "customCommands": [
    ...
    {
      "name": "list-comprehension",
      "prompt": "{{{ input }}}\n\nRefactor the selected python code to use list comprehensions wherever possible. Present the output as a python code snippet.",
      "description": "Refactor to use list comprehensions"
    }
  ]

You can then call your custom command from the chat window by selecting code and adding it to the context with Ctrl/Cmd-L, followed by invoking your command (/list-comprehension).

Custom Command

Experimenting with different models

Another nice feature of continue is the ability to easily toggle between different models in the chat panel. You can configure this using the "models" section of the core config.json. For me, this was useful to experiment with the difference between the various sizes in the Granite family.

Model toggle

To set this up, you simply have to add additional entries in the "models" list:

  "models": [
    {
      "title": "Granite 3.2 8b",
      "provider": "ollama",
      "model": "granite3.2:8b"
    },
    {
      "title": "Granite 3.2 8b 128k",
      "provider": "ollama",
      "model": "granite3.2:8b",
      "contextLength": 131072
    },
    {
      "title": "Granite 3.2 8b Thinking",
      "provider": "ollama",
      "model": "granite3.2:8b",
      "contextLength": 131072,
      "systemMessage": "Knowledge Cutoff Date: April 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant.\nRespond to every user query in a comprehensive and detailed way. You can write down your thought process before responding. Write your thoughts after 'Here is my thought process:' and write your response after 'Here is my response:' for each user query. You are a helpful AI assistant.\nRespond to every user query in a comprehensive and detailed way. You can write down your thought process before responding. Write your thoughts after 'Here is my thought process:' and write your response after 'Here is my response:' for each user query."
    },
    {
      "title": "Granite 3.2 2b",
      "provider": "ollama",
      "model": "granite3.2:2b"
    },
    {
      "title": "Granite 3.1 3b-a800m",
      "provider": "ollama",
      "model": "granite3.1-moe:3b"
    },
    {
      "title": "Granite 3.1 1b-a400m",
      "provider": "ollama",
      "model": "granite3.1-moe:1b"
    }
  ],

Here are some other models on ollama that may be worth trying out. Many of these models do not have standard OSS licenses, but may be worth experimenting with:

Model	Sizes	License	Link
llama 3	8B, 70B	META LLAMA 3	https://ollama.com/library/llama3
codellama	7B, 13B, 34B, 70B	LLAMA 2	https://ollama.com/library/codellama
deepseek-coder-v2	16B, 236B	DEEPSEEK	https://ollama.com/library/deepseek-coder-v2
codegemma	2B, 7B	Gemma	https://ollama.com/library/codegemma
starcoder2	3B, 7B, 15B	BigCode Open RAIL-M v1	https://ollama.com/library/starcoder2
codestral	22B	Mistral	https://github.com/open-assistant/codestral

Import local models from GGUF and GGML

While the ollama library is a great tool to manage your models, many of us also have numerous model files already downloaded on our machines that we don't want to duplicate. The ollama Modelfile is a powerful tool that can be used to create customized model setups by deriving from known models and customizing the inference parameters, including the ability to add (Q)LoRA Adapters (see the docs for more details).

For our purpose, we only need the simple FROM statement, which can point to a known model in the ollama library or a local file on disk. This makes it really easy wrap the process into an import-to-ollama bash script:

#!/usr/bin/env bash

file_path=""
model_name=""
model_label="local"
while [[ $# -gt 0 ]]
do
    key="$1"

    case $key in
        -f|--file)
            file_path="$2"
            shift
            ;;
        -m|--model-name)
            model_name="$2"
            shift
            ;;
        -l|--model-label)
            model_label="$2"
            shift
            ;;
        *)
            echo "Unknown option: $key"
            exit 1
            ;;
    esac
    shift
done

if [ "$file_path" == "" ]
then
    echo "Missing required argument -f|--file"
    exit 1
fi
file_path="$(realpath $file_path)"

# Check if model_name is empty and assign file name as model_name if true
if [ "$model_name" == "" ]
then
    model_name=$(basename $file_path)
    model_name="${model_name%.*}"
fi

# Append the model label to the model name
model_name="$model_name:$model_label"
echo "model_name: $model_name"

# Create a temporary directory for working
tempdir=$(mktemp -d)
echo "Working Dir: $tempdir"

# Write the file path to Modelfile in the temporary directory
echo "FROM $file_path" > $tempdir/Modelfile

# Import the model using ollama create command
echo "importing model $model_name"
ollama create $model_name -f $tempdir/Modelfile

Local LLM Web UI

There are numerous additional AI applications, use cases, and patterns that can be adapted to work with local LLMs. Exploring LLMs locally can be greatly accelerated with a local web UI. The Open WebUI project (spawned out of ollama originally) works seamlessly with ollama to provide a web-based LLM workspace for experimenting with prompt engineering, retrieval augmented generation (RAG), and tool use.

To set up Open WebUI, follow the steps in their documentation. The simplest versions are:

Docker

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

Pip

pip install open-webui
open-webui serve

Once running, you can open the UI at http://localhost:8080.

open http://localhost:8080

The first time you log in, you'll need to set up an "account." Since this entirely local, you can fill in garbage values (foo@bar.com/asdf) and be off to the races!

Open WebUI

TLDR

For the impatient, here's the end-to-end setup script:

# Install ollama
brew install ollama

# Start the ollama server in the background
ollama serve &

# Download IBM Grainte models
ollama pull granite3.2:8b
ollama pull granite-code:3b
ollama pull granite-embedding:30m

# Install continue in VS Code
code --install-extension continue.continue

# Configure continue to use the models
printf %s\\n "{\"models\":[{\"title\":\"Granite 3.2 8b\",\"provider\":\"ollama\",\"model\":\"granite3.2:8b\"}],\"customCommands\":[{\"name\":\"test\",\"prompt\":\"{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.\",\"description\":\"Write unit tests for highlighted code\"}],\"tabAutocompleteModel\":{\"title\":\"Granite Code 3b\",\"provider\":\"ollama\",\"model\":\"granite-code:3b\"},\"allowAnonymousTelemetry\":false,\"embeddingsProvider\":{\"provider\":\"ollama\",\"model\":\"granite-embedding:30m\",\"maxChunkSize\":512}}" > $HOME/.continue/config.json

Summary

I've demonstrated how to solve the problems of cost, licensing, and data privacy in adopting AI co-pilot tools in an enterprise setting using IBM's Granite Models, Ollama, Visual Studio Code, and Continue. With this setup, developers can effectively avoid the common obstacles to adopting AI-powered development tools in enterprise environments, including data privacy concerns, licensing restrictions, and cost barriers. Using local LLMs offers a unique opportunity for developers to harness the capabilities of AI-driven code completion, refactoring, and analysis while ensuring the integrity and security of their codebase.

Next steps

For a practical tour of building an application with your newly set up code assistant based on Granite, check out the tutorial, "Developing a gen AI application using IBM Granite."

Explore more articles and tutorials about watsonx on IBM Developer.

Try watsonx for free

The Granite models are all available in watsonx.ai.

Build an AI strategy for your business on one collaborative AI and data platform called IBM watsonx, which brings together new generative AI capabilities, powered by foundation models, and traditional machine learning into a powerful platform spanning the AI lifecycle. With watsonx.ai, you can train, validate, tune and deploy models with ease and build AI applications in a fraction of the time with a fraction of the data. These models are accessible to all as many no-code and low-code options are available for beginners.

Try watsonx.ai, the next-generation studio for AI builders.