In this tutorial, I will show how to use a collection of open source components to run a feature-rich developer copilot in Visual Studio Code while meeting data privacy, licensing, and cost challenges common to enterprise users. The setup is powered by local large language models (LLMs) with IBM's open-source llm family, Granite. All components run on a developer's workstation and have business friendly licensing. For the quick version, just jump to the TL;DR script end-to-end setup script.
The developer world is quickly becoming the best place for AI developers to drink our own champagne with the promise of generative AI to accelerate our own work. There are numerous excellent AI co-pilot tools out there in the market (GitHub Copilot, tabnine, Sourcegraph Cody, watsonx Code Assistant to name just a few). These tools offer in-editor chatbots, code completion, code explanation, test generation, auto-documentation, and a host of other developer-centric tools. Unfortunately, for many of us, these tools sit out of reach behind corporate data privacy policies (yes, we can access watsonx Code Assistant here at IBM, but the rest are not available.)
There are three main barriers to adopting these tools in an enterprise setting:
Data Privacy: Many corporations have privacy regulations that prohibit sending internal code or data to third party services.
Generated Material Licensing: Many models, even those with permissive usage licenses, do not disclose their training data and therefore may produce output that is derived from training material with licensing restrictions.
Cost: Many of these tools are paid solutions which require investment by the organization. For larger organizations, this would often include paid support and maintenance contracts which can be extremely costly and slow to negotiate.
The first problem to solve is avoiding the need to send code to a remote service. One of the most widely used tools in the AI world right now is Ollama which wraps the underlying model serving project llama.cpp. The ollama CLI makes it seamless to run LLMs on a developer's workstation, using the OpenAI API with the /completions and /chat/completions endpoints. Users can take advantage of available GPU resources and offload to CPU where needed. My workstation is a MacBook Pro with an Apple M3 Max and 64GB of shared memory which means I have roughly 45GB of usable VRAM to run models with! Users with less powerful hardware can still use ollama with smaller models and/or models with higher levels of quantization.
On a Mac workstation, the simplest way to install ollama is via their webpage: https://ollama.com/download. This will install a menu-bar app to run the ollama server in the background and keep you up-to-date with the latest releases.
You can also install ollama with homebrew:
brew install ollama
Show more
If installing from brew, or building from source, you need to boot the central server:
ollama serve
Show more
Step 2. Fetch the Granite models
The second problem to solve is choosing a model that gives high-quality output and was trained on enterprise safe data. There are numerous good code models available on the ollama library and huggingface. According to this paper published by IBM Research titled, "Granite 3.0 Language Models", the IBM Granite models meticulously curated their training data to ensure all training code carried enterprise-friendly licenses and all text did not contain any hate, abuse, or profanity. Since generated material licensing is one of the primary concerns I've already identified, and since I work for IBM, I chose this family of models for my own use.
Granite comes in a range of sizes and architectures to fit your workstation's available resources. Generally, the bigger dense models perform best, but require more resources and will be slower. I chose the 8b dense option as my starting point for chat and the Granite Code 3b option for autocomplete. Ollama offers a convenient pull feature to download models:
In addition to the language models for chat and code generation, you will need a strong embedding model to enable the Retrieval Augmented Generation (RAG) capabilities of Continue. The Granite family also contains strong, lightweight embedding models. I chose granite-embedding:30m since my code is entirely in english and the 30m model performs well at a fraction of the weights of other leading models. You can pull it too!
ollama pull granite-embedding:30m
Show more
Step 3. Set up Continue
With the Granite models available and ollama running, it's time to start using them in your editor. The first step is to get Continue installed into Visual Studio Code. This can be done with a quick command line call:
code --install-extension continue.continue
Show more
Alternately, you can install continue using the extensions tab in VS Code:
Open the Extensions tab.
Search for "continue."
Click the Install button.
Next, you need to configure Continue to use your Granite models with Ollama.
Open the command palette (Press Ctrl/Cmd+Shift+P)
Select Continue: Open config.json.
This will open the central config file ($HOME/.continue/config.json by default) in your editor. To enable your ollama Granite models, you'll need to edit two sections:
models: This will set up the model to use for chat and long-form prompts (e.g. explain)
Once you're off the ground with the basic setup, there are lots of great ways to extend the framework to fit your personal needs.
Setting up custom commands
One of the great features of continue is the ability to develop your own prompt-engineered commands. This can all be done in the "customCommands" section of the core config.json.
As an example, I created the /list-comprehension command to help with refactoring python code to use list/dict comprehensions wherever possible:
"customCommands":[
...
{"name":"list-comprehension","prompt":"{{{ input }}}\n\nRefactor the selected python code to use list comprehensions wherever possible. Present the output as a python code snippet.","description":"Refactor to use list comprehensions"}]
Show more
You can then call your custom command from the chat window by selecting code and adding it to the context with Ctrl/Cmd-L, followed by invoking your command (/list-comprehension).
Experimenting with different models
Another nice feature of continue is the ability to easily toggle between different models in the chat panel. You can configure this using the "models" section of the core config.json. For me, this was useful to experiment with the difference between the various sizes in the Granite family.
To set this up, you simply have to add additional entries in the "models" list:
"models":[{"title":"Granite 3.2 8b","provider":"ollama","model":"granite3.2:8b"},{"title":"Granite 3.2 8b 128k","provider":"ollama","model":"granite3.2:8b","contextLength":131072},{"title":"Granite 3.2 8b Thinking","provider":"ollama","model":"granite3.2:8b","contextLength":131072,"systemMessage":"Knowledge Cutoff Date: April 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant.\nRespond to every user query in a comprehensive and detailed way. You can write down your thought process before responding. Write your thoughts after 'Here is my thought process:' and write your response after 'Here is my response:' for each user query. You are a helpful AI assistant.\nRespond to every user query in a comprehensive and detailed way. You can write down your thought process before responding. Write your thoughts after 'Here is my thought process:' and write your response after 'Here is my response:' for each user query."},{"title":"Granite 3.2 2b","provider":"ollama","model":"granite3.2:2b"},{"title":"Granite 3.1 3b-a800m","provider":"ollama","model":"granite3.1-moe:3b"},{"title":"Granite 3.1 1b-a400m","provider":"ollama","model":"granite3.1-moe:1b"}],
Show more
Here are some other models on ollama that may be worth trying out. Many of these models do not have standard OSS licenses, but may be worth experimenting with:
While the ollama library is a great tool to manage your models, many of us also have numerous model files already downloaded on our machines that we don't want to duplicate. The ollamaModelfile is a powerful tool that can be used to create customized model setups by deriving from known models and customizing the inference parameters, including the ability to add (Q)LoRA Adapters (see the docs for more details).
For our purpose, we only need the simple FROM statement, which can point to a known model in the ollama library or a local file on disk. This makes it really easy wrap the process into an import-to-ollama bash script:
#!/usr/bin/env bash
file_path=""
model_name=""
model_label="local"while [[ $# -gt 0 ]]
do
key="$1"case$keyin
-f|--file)
file_path="$2"shift
;;
-m|--model-name)
model_name="$2"shift
;;
-l|--model-label)
model_label="$2"shift
;;
*)
echo"Unknown option: $key"exit 1
;;
esacshiftdoneif [ "$file_path" == "" ]
thenecho"Missing required argument -f|--file"exit 1
fi
file_path="$(realpath $file_path)"# Check if model_name is empty and assign file name as model_name if trueif [ "$model_name" == "" ]
then
model_name=$(basename$file_path)
model_name="${model_name%.*}"fi# Append the model label to the model name
model_name="$model_name:$model_label"echo"model_name: $model_name"# Create a temporary directory for working
tempdir=$(mktemp -d)
echo"Working Dir: $tempdir"# Write the file path to Modelfile in the temporary directoryecho"FROM $file_path" > $tempdir/Modelfile
# Import the model using ollama create commandecho"importing model $model_name"
ollama create $model_name -f $tempdir/Modelfile
Show more
Local LLM Web UI
There are numerous additional AI applications, use cases, and patterns that can be adapted to work with local LLMs. Exploring LLMs locally can be greatly accelerated with a local web UI. The Open WebUI project (spawned out of ollama originally) works seamlessly with ollama to provide a web-based LLM workspace for experimenting with prompt engineering, retrieval augmented generation (RAG), and tool use.
To set up Open WebUI, follow the steps in their documentation. The simplest versions are:
Once running, you can open the UI at http://localhost:8080.
open http://localhost:8080
Show more
The first time you log in, you'll need to set up an "account." Since this entirely local, you can fill in garbage values (foo@bar.com/asdf) and be off to the races!
TLDR
For the impatient, here's the end-to-end setup script:
# Install ollama
brew install ollama
# Start the ollama server in the background
ollama serve &
# Download IBM Grainte models
ollama pull granite3.2:8b
ollama pull granite-code:3b
ollama pull granite-embedding:30m
# Install continue in VS Code
code --install-extension continue.continue
# Configure continue to use the modelsprintf %s\\n "{\"models\":[{\"title\":\"Granite 3.2 8b\",\"provider\":\"ollama\",\"model\":\"granite3.2:8b\"}],\"customCommands\":[{\"name\":\"test\",\"prompt\":\"{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.\",\"description\":\"Write unit tests for highlighted code\"}],\"tabAutocompleteModel\":{\"title\":\"Granite Code 3b\",\"provider\":\"ollama\",\"model\":\"granite-code:3b\"},\"allowAnonymousTelemetry\":false,\"embeddingsProvider\":{\"provider\":\"ollama\",\"model\":\"granite-embedding:30m\",\"maxChunkSize\":512}}" > $HOME/.continue/config.json
Show more
Summary
I've demonstrated how to solve the problems of cost, licensing, and data privacy in adopting AI co-pilot tools in an enterprise setting using IBM's Granite Models, Ollama, Visual Studio Code, and Continue. With this setup, developers can effectively avoid the common obstacles to adopting AI-powered development tools in enterprise environments, including data privacy concerns, licensing restrictions, and cost barriers. Using local LLMs offers a unique opportunity for developers to harness the capabilities of AI-driven code completion, refactoring, and analysis while ensuring the integrity and security of their codebase.
The Granite models are all available in watsonx.ai.
Build an AI strategy for your business on one collaborative AI and data platform called IBM watsonx, which brings together new generative AI capabilities, powered by foundation models, and traditional machine learning into a powerful platform spanning the AI lifecycle. With watsonx.ai, you can train, validate, tune and deploy models with ease and build AI applications in a fraction of the time with a fraction of the data. These models are accessible to all as many no-code and low-code options are available for beginners.
Try watsonx.ai, the next-generation studio for AI builders.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.