A Weekend AI Project: Running LLaMA and Gemma AI Models on the Android Phone

Author:Murphy | View: 23456 | Time: 2025-03-22 22:29:46

Nowadays, "mobile AI" is a fast-growing trend. Smartphones become more powerful, and large models become more efficient. Some customers may want to wait until new features are added by phone manufacturers, but can we use the latest AI models on our own? Indeed, we can, and the results are fun. In this article, I will show how to run LLaMA and Gemma large language models on an Android phone, and we will see how it works. As usual in all my tests, all models will run locally, and no cloud APIs or payments are needed.

Let's get into it!

Termux

The first component of our test is Termux, a full-fledged Linux terminal made as an Android application. It is free, and it does not require root access; all Linux components are running exclusively in a Termux folder. Termux can be downloaded from Google Play, but at the time of writing this text, that version was pretty old, and the "pkg update" command in Termux did not work anymore. A newer version is available as an APK on the F-Droid website; it works well, and I had no problems with it.

When Termux is installed on the phone, we can run it and see a standard Linux command-line interface:

In theory, we can enter all commands directly on the phone, but typing on the tiny keyboard is inconvenient. A much better way is to install SSH; this can be done by using "pkg install":

pkg update
pkg upgrade
pkg install openssh

After that, we can start the SSH daemon in Termux by running the sshd command. We also need to get the user name and set the SSH password:

sshd
whoami
#> u0_a461
passwd
#> Enter new password
...

Now, we can connect to a phone with any SSH client:

ssh -p 8022 [email protected]

Here, 8022 is a default Termux SSH port, "u0_a461" is a user name that we get from a "whoami" command, and "192.168.100.101" is the IP address of the phone in a local Wi-Fi network.

When the connection is done, we are ready to test different LLMs. All commands presented below should be executed via SSH on a phone.

Llama.CPP

The first project we will test is Llama.CPP. I use it often because it's great for testing LLMs on different hardware. Llama.CPP can work almost everywhere – on a CPU, CUDA, or Apple Silicon. Original Llama.CPP is written in C++, but I will be using a Python library, which is easier to use. Let's install the needed packages and libraries:

pkg install tur-repo libopenblas libandroid-execinfo ninja binutils
pkg install python3 python-numpy build-essential cmake clang git
pip3 install llama-cpp-python huggingface-hub

A huggingface-hub package is useful for downloading the models. For our first test, I will be using a LLaMA 7B model:

huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

Now we are ready to use the model:

from llama_cpp import Llama

llm = Llama(model_path="llama-2-7b-chat.Q4_K_M.gguf",
            n_gpu_layers=0,
            n_ctx=1024,
            echo=True)

question = input(">>> Enter your question: ")
template = f"""[INST] <>
You are an expert, please answer the question.
<>
{question}[/INST]"""

stream = llm(template, stream=True, max_tokens=256, temperature=0.2)
for output in stream:
    print(output['choices'][0]['text'], end="")

Here, the n_gpu_layers parameter is set to 0 because in Termux we have no GPU support. Which smartphone is good enough to run a 7B model? Before running the code on the phone, I checked the results in Google Colab, and the resource consumption is pretty moderate:

Running a LLaMA 7B model, screenshot by author

As we can see, memory consumption is pretty small, and in theory, this model can run on almost every modern phone with 4–6 GB of RAM.

But obviously, the next big question is speed. Honestly, I did not expect any good performance from a 7B model on a smartphone. But surprisingly, it works well enough. This is an unedited video showing the real speed:

LLaMA 7B Inference, Image by author

Llama.CPP is memory-efficient; it does not load the full model in RAM. With enough free storage space, we can even run a 70B model (its file size is about 40 GB!). It works on a smartphone, but the inference time for the same answer was about 15 minutes. Naturally, there is no practical need for that, but the result is still fun. It's amazing that a 70B parameter model can run fully offline on a pocket device (just for reference, the GPT 3.5 model has 175B parameters).

Now, let's test another kind of model.

Gemma.CPP

At the time of writing this text, Gemma is one of the newest Large Language Models; it was introduced by Google in February 2024. This model is interesting for several reasons:

According to Google, the Gemma model provides state-of-the-art performance for its size. For example, on the MMLU (Massive Multitask Language Understanding) benchmark, Gemma 7B outperforms LLaMA2 13B (64.3 vs. 54.8 score), and a Gemma 2B model is only slightly worse compared to LLaMA 7B (42.3 vs. 45.3 score).

A 2B model should be significantly faster compared to the 7B, which can be crucial for the mobile device.

Gemini's license allows commercial use and Gemma.CPP client (as its name suggests) is written in C++. Which allows mobile developers to natively build it as a library in Android Studio or XCode and use this model in smartphone applications.

So, let's see how it practically works! In the same way as for LLaMA, I will use a Termux console for that. First, we need to download and install Gemma.CPP:

pkg install wget cmake git clang python3 wget https://github.com/google/gemma.cpp/archive/refs/tags/v0.1.0.tar.gz tar -xvzf v0.1.0.tar.gz cd gemma.cpp-0.1.0 cd build cmake .. make -j4 gemma cd ../../

To get access to the model, we need to log in to Kaggle, go to the model page, and accept the license agreement. It is possible to copy a model file to the smartphone manually, but the Kagglehub Python library provides a more convenient way to do this.

First, we need to install the packages:

pip3 install packaging kagglehub

Now, we can make a simple program to download the model, and run it on the smartphone:

import os, kagglehub os.environ["KAGGLE_USERNAME"] = "..." os.environ["KAGGLE_KEY"] = "..." model_path = kagglehub.model_download('google/gemma/gemmaCpp/2b-it-sfp') print("Path:", model_path)

The KAGGLE_USERNAME and KAGGLE_KEY parameters can be copied from a free access token, that can be generated in Kagge settings.

When the download is finished, we will get the path, which looks like "/data/data/com.termux/…/gemmaCpp/2b-it-sfp/2". Now, let's make a shell script that will run the inference:

model_path=/data/data/com.termux/files/home/.cache/kagglehub/models/google/gemma/gemmaCpp/2b-it-sfp/2 ./gemma.cpp-0.1.0/build/gemma --model 2b-it --tokenizer $model_path/tokenizer.spm --compressed_weights $model_path/2b-it-sfp.sbs

Finally, we are ready to run the script. The output looks like this:

Gemma 2B Inference, Image by author

Indeed, it's fast enough, and the result looks accurate. The RAM consumption for the 7B and 2B models is 9.9 GB and 4.1 GB, respectively.

Conclusion

In this article, we tested Llama.CPP and Gemma.CPP open-source projects, and were able to run 2B, 7B, and even 70B parameter models on the Android smartphone. As we can see, running modern LLMs on a smartphone is doable. At the time of writing this text, even budget $199 phones have about 8 GB of RAM and 256 GB of storage, so a 2 GB model can run on almost every modern phone, not only the top ones.

Why is it important? Large language models can easily run in the cloud, but cloud infrastructure costs money, and customers will, directly or indirectly, pay for every token generated by the cloud API. Running a model locally can reduce costs and will allow developers to make less expensive or even free apps.

From a development perspective, both Llama.CPP and Gemma.CPP projects are written in C++ without external dependencies and can be natively compiled with Android or iOS applications (at the time of writing this text, I already saw at least one application available as an APK for Android and in the Testflight service for iOS). A Termux console is also interesting for testing; it allows developers to use the same Python code on a smartphone and in Google Colab, which can be nice for fast prototyping.

Thanks for reading. If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. You are also welcome to connect via LinkedIn. If you want to get the full source code for this and other posts, feel free to visit my Patreon page.

Those who are interested in using language models and natural language processing are also welcome to read other articles:

GPT Model: How Does it Work?

16, 8, and 4-bit Floating Point Formats – How Does it Work?

A Weekend AI Project (Part 1): Running Speech Recognition and a LLaMA-2 GPT on a Raspberry Pi

A Weekend AI Project (Part 2): Using Speech Recognition, PTT, and a Large Action Model on a Raspberry Pi

A Weekend AI Project (Part 3): Making a Visual Assistant for People with Vision Impairments

Tags: AI Android Large Language Models Programming Python

Add Fav

~~Comment~~

Murphy

Add friends

View space

Message

Recommend

◦ Roadmap to Becoming a Data Scientist, Part 2: Software Engineering

◦ Why Batch Normalization Matters for Deep Learning

◦ Building a Data Platform in 2024

◦ Building a Knowledge Graph From Scratch Using LLMs

◦ Building a Data Warehouse

◦ Plot outside the box – 8 Alternative Circle charts with Python to replace Rectangular charts.

◦ Isochrones in Python

◦ Deep Dive into WebSockets and Their Role in Client-Server Communication

◦ A Case for Bagging and Boosting as Data Scientists' Best Friends

◦ An ImPULSE to Action: A Practical Solution for Positive Unlabelled Classification

◦ Time Series for Climate Change: Forecasting Large Ocean Waves

◦ Large Language Models, GPT-3: Language Models are Few-Shot Learners