CodeLlama vs. CodeGemma: Using Open Models for AI Coding Assistance

Author:Murphy | View: 22802 | Time: 2025-03-22 21:43:51

The AI coding-tools market is a billion-dollar industry. It is expected to reach $17.2 billion by 2030, and even today, AI plugins for VS Code or JetBrains IDE have millions of downloads. But can we run a local model as a free coding assistant, and how well will it perform? In this article, I will test two open models, Code Gemma and Code Llama. I will install them on my PC, and we will see how they work.

Without further ado, let's get into it!

1. Models

At the time of writing this article, two major open models are available for free download and could be used for coding purposes:

CodeLlama. The model was released in 2023 by Meta; it is available in 7B, 13B, 34B, and 70B sizes. "Base," "Instruct," and "Python" models are available. Despite four sizes, only the 7B and 13B models can realistically be used locally; others are just too "heavy."
CodeGemma. The model was released in 2024 by Google and is available in 2B and 7B sizes. A 2B model was trained only for code completion, and a 7B model was trained for code infilling and natural language prompts.

In this article, I will test the 7B and 13B models, which are available on HuggingFace and can be downloaded in GGUF format. I will run an OpenAI-compatible local server that will allow us to use these models with different apps. But before doing that, let's just run models in Python to see what they can do. Those readers who want to jump into practical use can skip this part.

To test both models, I will use a free Google Colab instance. First, let's load the model and the tokenizer:

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import transformers
import torch

model_id = "..."
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=False,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config1,
    device_map="cuda",
    torch_dtype=torch.bfloat16,
)

A Transformers library from HuggingFace is smart enough to download the model files automatically. A 7B model requires about 16.2 GB of GPU RAM, so I run the model in a 4-bit resolution with the help of a bits and bytes library; after that, the required memory footprint is only about 5 GB.

Now, let's create a code snippet to test the models. As an example, I created a Python method to write a list of strings to a file:

python_code = """
class Writer:
   def write_file(self, filename: str, data: List[str]):
        """ Write list of strings to a text file """
        with open(filename, 'w') as f_out:
            for line in data:
                f_out.write(f"{line}n")
"""

To test the model's coding abilities, let's ask both models to make a "pytest" for it:

chat = [{
    "role": "user",
    "content": f"Write a pytest for this Python method:n{python_code}. "
               f"Delete the created file at the end of the test."
    }]

prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=1024)
result = tokenizer.decode(outputs[0])

As for results, CodeLlama 7B generated this code, and the generation took 19 seconds:

import pytest

class TestWriter:
    def test_write_file(self):
        writer = Writer()
        filename = 'test.txt'
        data = ['line1', 'line2', 'line3']
        writer.write_file(filename, data)
        with open(filename, 'r') as f:
            lines = f.readlines()
            assert lines == data
        os.remove(filename)

CodeGemma generated this code, and the process took 16 seconds:

import pytest

def test_write_file():
    """ Test the write_file method """
    filename = "test.txt"
    data = ["This is a test", "line 2", "line 3"]
    Writer().write_file(filename, data)

    with open(filename, "r") as f:
        assert f.read() == "This is a testnline 2nline 3n"

    import os
    os.remove(filename)

Personally, I like the second version more. First, CodeGemma provided a docstring description of the method, which is a requirement of modern "linter" tools. Second, a code Writer().write_file(...) looks more compact and readable compared to declaring a writer variable and using it later. Third, CodeGemma imported the "os" Python module, and CodeLlama "forgot" to do this.

At first glance, both code snippets look correct. Let's run the code by executing the pytest -v file.py command:

Actually, I was wrong about the correctness of both tests, and there is a bug in the first one. Funny enough, the second test not only looks better, but it also works, while the first does not. The error is obvious from a screenshot; readers are welcome to figure out how to fix it on their own.

Initially, I was not going to test the CodeGemma 2B "Code Completion" model, but as a bonus for readers, let's do it! The loading of the model is the same; we only need to change the model ID:

model_id = "google/codegemma-2b"
model = AutoModelForCausalLM.from_pretrained(model_id, ...)

This model was trained for code completion. It does not need any description in English, and we only need to provide the source code:

# Prompt
python_code = """
class Writer:
   def write_file(self, filename: str, data: List[str]):
      ...

import pytest

def test_write_file():
    """ Test the write_file method """
"""

prompt = f"""
<|fim_prefix|>{python_code}
<|fim_suffix|>
<|fim_middle|>
"""

# Run inference
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
prompt_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0][prompt_len:]))

The result was surprisingly good, considering the small size of the model. It generated this output:

def test_write_file():
    """ Test the write_file method """
    writer = Writer()
    writer.write_file("test_file.txt", ["Hello", "World"])
    with open("test_file.txt", "r") as f:
        lines = f.readlines()
    assert lines == ["Hello
", "World
"]

As we can see, this code will not work "out of the box," but the logic looks correct. The required fix is to properly format the assert line:

assert lines == ["Hellon", "Worldn"]

After that, the "pytest" passed. The model also did not remove the file after the test, but I did not ask for it in the prompt. Last but not least, the execution time of a small model was only 3.3 seconds, ~5x faster compared to a bigger one.

2. Running a Llama Server

We tested our models in Python; now let's run a local OpenAI-compatible server. I will be using Llama-cpp-python for that. It is a nice and lightweight project; we can run any model we want using a single command line:

# Code Gemma
python3 -m llama_cpp.server --model codegemma-7b-it-Q4_K_M.gguf --n_ctx 8192 --n_gpu_layers -1 --host 0.0.0.0 --port 8000

# Code Llama 7B
python3 -m llama_cpp.server --model codellama-7b-instruct.Q4_K_M.gguf --n_ctx 8192 --n_gpu_layers -1 --host 0.0.0.0 --port 8000

# Code Llama 13B
python3 -m llama_cpp.server --model codellama-13b-instruct.Q4_K_M.gguf --n_ctx 8192 --n_gpu_layers -1 --host 0.0.0.0 --port 8000

If there is not enough GPU RAM to load the model, the n_gpu_layers parameter can be changed to load only some layers into the GPU. We can also run a model on Apple Silicon or even on a CPU, but it will obviously be slower.

3. Apps

Currently, we have a local OpenAI-compatible server, and we are ready to test some apps!

3.1 AI Shell

AI Shell is an open-source app that can convert natural language prompts to console commands. This app is pretty popular, and at the time of writing, this project has 3.6K stars on GitHub. AI Shell is written in TypeScript, and we can install the app via the npm package manager (here, I also installed Node JS 20.13.0):

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash
nvm install v20.13.0
npm install -g @builder.io/ai-shell

Before running the app, we need to configure the API endpoint:

ai config set OPENAI_KEY=12345678
ai config set OPENAI_API_ENDPOINT=http://127.0.0.1:8000/v1

Now, we can start the conversation with a model at any time by entering the "ai chat" command in the console:

AI Shell terminal output, Image by author

Another way of using a program is to enter the command we want to execute. For example, we can type something like "show files in the current folder":

Alas, with a free 7B model, it did not work, and the model was not able to produce the correct shell command. Also, the word "script" inside the prompt apparently confused the model, and it generated the text about the movie script.

This issue could probably be fixed by tuning the prompt, but at the time of making this text, the prompts were hardcoded in the TypeScript source and could not be easily configured. Nobody has responded to my feature suggestion on GitHub yet, but hopefully, it will be improved in the future.

3.2 ShellGPT

ShellGPT is another interesting open-source project that has 8.3K stars on GitHub at the time of making this text. We can easily install the application using pip:

pip3 install shell-gpt

To use ShellGPT with a local model, we need to change an API endpoint in the ~/.config/shell_gpt/.sgptrc file:

API_BASE_URL=http://127.0.0.1:8000/v1
OPENAI_API_KEY=12345678

Then we can enter our requests directly in the terminal shell, almost the same way as in the previous app:

sgpt "Write a command to show local files"

Alas, a CodeGemma model did not work with ShellGPT, and a LlamaCpp server returned a Server 500 error: ‘System role not supported'. At first, I thought it was a LlamaCpp issue, but after watching the log, I saw that the model metadata has these lines:

{% if messages[0]['role'] == 'system' %}
  {{ raise_exception('System role not supported')

It is a pity that CodeGemma is not supporting the "system" role because it is widely used in the OpenAI API. Thus, OpenAI-compatible apps cannot use CodeGemma, which is a pity because, as we saw before, the code generated by CodeGemma was pretty good.

As for CodeLlama, ShellGPT works well:

Another useful feature is to execute a command directly in the terminal shell by specifying a --shell prefix:

There is room for improvement; for example, a "Show size of the Documents folder" prompt returned a du -sh ~/Documents response. This is a correct bash command, but the ShellGPT was not able to get it from a ```` string, and I got only acommand not found` error.

3.3 CodeGPT

Using bash commands can be useful, but how about actual coding assistance? We can do it with the help of the open-source CodeGPT plugin. First, First, I installed the plugin in my PyCharm IDE and configured it for use with LlamaCpp:

As an example, let's consider this Python class:

class ServerConnection:
    """ Server connection handling """

    def __init__(self):
        self.is_connected = False
        self.connection_time = -1
        self.uploads_total = 0
        self.reconnects_total = 0
        self.reconnect_threshold_sec = 64

I will ask the model to refactor the variables into a separate Python data class.

As for the results, CodeGemma was not able to do it; it returned an error "System role not supported." CodeLlama 7B was not able to complete the task; it created a standard class instead of a data class. CodeLlama 13B performed the task well:

As a next step, I decided to ask something more complex and entered a create a UI Python application with a textfield and button prompt. A Llama 13B model generated this code:

import tkinter as tk

# Create the main window
root = tk.Tk()
root.title("My Application")

# Create a text field
text_field = tk.Entry(root)
text_field.pack()

# Create a button
button = tk.Button(root, text="Click Me!", command=lambda: print("You clicked the button!"))
button.pack()

# Start the main loop
root.mainloop()

The code is correct, but the application window was not visible – its size is not specified. I continued the chat and asked the model to change the title to "Hello World" and to set the window size to 320×200:

Now, the result was okay, and the requested app was working as expected:

I must admit that a 13B model is not perfect. In theory, it has a large context window and should use previous chat results, but when I asked the model to move the generated code into a class, it generated a new code without setting a window size or title:

import tkinter as tk

class HelloWorld(tk.Frame):
    def __init__(self, master=None):
        super().__init__(master)
        self.pack()

        # Create a text field
        self.text_field = tk.Entry(self)
        self.text_field.pack()

        # Create a button
        self.button = tk.Button(self, text="Click Me!", command=lambda: print("Button clicked!"))
        self.button.pack()

if __name__ == "__main__":
    root = tk.Tk()
    app = HelloWorld(root)
    root.mainloop()

But generally speaking, the model created a correct class, and with a bit of copy-paste, it was easy to finish the job.

4. Disadvantages

From all the last examples, we can see that the model works; it can generate both code and bash commands. But there are also some drawbacks and hiccups:

Using a local LLM instance requires a good graphics card. I have a 2.5-year-old GeForce RTX 3060 card with 8GB of GPU RAM. In my Colab tests, I saw that 8 GB is enough to run a 7B model, but on a real desktop, there was not enough CUDA memory for that – an OS itself also needs some GPU to work. Practically, to run a 13B model, at least 16 GB of GPU RAM is required, and 24 GB would be recommended to have some space for future improvements. Does it make practical sense? Considering the current GPU prices, I am not 100% sure – for 1000–1500$ we can have an AI subscription for years.
The open-source apps are not perfect. In my tests, a LlamaCpp server sometimes crashed with a "segmentation fault," the CodeGPT app sometimes did not send any requests to a model, and I had to restart PyCharm, and so on. It is open-source, and there are no guarantees of any kind, so I'm not complaining, but I must admit that for these AI tools, we are in the "early adoption" stage.
It is also interesting to mention that running a large local language model is an energy-consuming task. As a last test, I connected a power meter to my desktop PC. It turned out that for normal work, it consumes about 80 watts. But when the LLM request is running, the energy consumption increases almost three times:

A power consumption during the AI model request, Image by author

Conclusion

In this article, I tested the ability of open language models to work as a coding assistant, and the results are interesting:

Even small 7B and 13B models can do some coding tasks like refactoring, making unit tests, or writing small code templates. Obviously, these models are less capable compared to large ones like 175B ChatGPT 3.5, but using a local model does not require any subscription costs; it can also be faster and better from a privacy perspective.
On the other hand, running a local model requires high-end hardware, which can be not only expensive but also energy-consuming. At the time of writing this article, a high-end GPU may cost up to $1500, which is just impractical for only running local LLMs – for that cost, we can have a subscription to a cloud service for a very long time.
The challenge of using AI tools is not only with hardware but also with software. At least at the time of writing this post, the open-source ecosystem of AI software is not yet mature. I was surprised to find that there are 39,769 open 7B models on HuggingFace, but the number of open-source AI apps on GitHub is minuscule. These 3, described in this article, are almost all I was able to find (if I missed something, please write in the comments below, and maybe I will make a second part of the review).

In general, using a local LLM for everyday coding tasks is doable, but as we can see, there are still many challenges, both in software and hardware. We also know that different companies are now hardly working toward more efficient AI chips and more efficient models. New models like Microsoft's Phi-3 are capable of running even on mobile hardware. How will it change the AI industry? Will integrated graphics cards of the next generation be cheap, silent, and CUDA-compatible? We don't know yet. Apparently, a lot of new AI-related hardware will be announced (M4 was already the first), and at least I hope that new hardware will not be proprietary without any drivers for open use.

Thanks for reading. If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. You are also welcome to connect via LinkedIn, where I periodically publish smaller posts that are not big enough for a full article. If you want to get the full source code for this and other posts, feel free to visit my Patreon page.

Those who are interested in using language models and natural language processing are also welcome to read other articles:

Tags: AI Data Science Hands On Tutorials Programming Python