A Weekend AI Project: Making a Visual Assistant for People with Vision Impairments

Author:Murphy | View: 30056 | Time: 2025-03-22 22:47:06

Modern large multimodal models (LMMs) can process not only text but also different types of data. Indeed, "a picture is worth a thousand words," and this functionality can be crucial during the interaction with the real world. In this "weekend project," I will use a free LLaVA (Large Language-and-Vision Assistant) model, a camera, and a speech synthesizer; we will make an AI assistant that can help people with vision impairments. In the same way as in previous parts, all components will run fully offline without any cloud cost.

Without further ado, let's get into it!

Components

In this project, I will use several components:

A LLaVA model, which combines a large language model and a visual encoder with the help of a special projection matrix. This allows the model to understand not only text but also image prompts. I will be using the LlamaCpp library to run the model (despite its name, it can run not only LLaMA but Llava models as well).
Streamlit Python library that allows us to make an interactive UI. Using the camera, we can take the image and ask the LMM different questions about it (for example, we can ask the model to describe the image).
A TTS (text-to-speech) model will convert the LMM's answer into speech, so a person with vision impairment can listen to it. For the text conversion, I will use an MMS-TTS (Massively Multilingual Speech TTS) model made by Facebook.

As promised, all listed components are free to use, don't need any cloud API, and can work fully offline. From a hardware perspective, the model can run on any Windows or Linux laptop or tablet (an 8 GB GPU is recommended but not mandatory), and the UI can work in any browser, even on a smartphone.

Let's get started.

LLaVA

LLaVA (Large Language-and-Vision Assistant) is an open-source large multimodal model that combines a vision encoder and an LLM for visual and language understanding. As was mentioned before, I'll use a LlamaCpp to load the model. This library is great for using language models on different hardware; it supports quantization and can run everywhere, on CPU, CUDA, and Mac Silicon.

As a first step, let's download the model from HuggingFace:

7B
huggingface-cli download jartine/llava-v1.5-7B-GGUF llava-v1.5-7b-Q4_K.gguf --local-dir . --local-dir-use-symlinks False
huggingface-cli download jartine/llava-v1.5-7B-GGUF llava-v1.5-7b-mmproj-f16.gguf --local-dir . --local-dir-use-symlinks False

13B
huggingface-cli download PsiPi/liuhaotian_llava-v1.5-13b-GGUF llava-v1.5-13b-Q5_K_M.gguf --local-dir . --local-dir-use-symlinks False
huggingface-cli download PsiPi/liuhaotian_llava-v1.5-13b-GGUF mmproj-model-f16.gguf --local-dir . --local-dir-use-symlinks False

Here, we can choose the 7B or 13B model depending on our hardware capabilities. At the time of this writing, there are about 350 "LLaVA" models on HuggingFace, so the choice is wide, and readers are welcome to test other models on their own.

When the download is ready, let's load the model and its projection matrix:

from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler

model_file = "llava-v1.5-7b-Q4_K.gguf"
model_mmproj_file = "llava-v1.5-7b-mmproj-f16.gguf"
chat_handler = Llava15ChatHandler(clip_model_path=model_mmproj_file)
model = Llama(
        model_path=model_file,
        chat_handler=chat_handler,
        n_ctx=2048,
        n_gpu_layers=-1,  # Set to 0 if you don't have a GPU
        verbose=True,
        logits_all=True,
    )

I will also create a helper to convert an image into the base64 format:

from PIL import Image
from io import BytesIO

def image_b64encode(img: Image) -> str:
    """ Convert image to a base64 format """
    buffered = BytesIO()
    img.save(buffered, format="JPEG")
    return base64.b64encode(buffered.getvalue()).decode("utf-8")

Now we are ready to ask our model any question about any image:

def model_inference(model: Any, request: str, image: Image) -> str:
    """ Ask model a question """
    image_b64 = image_b64encode(image)
    out_stream = model.create_chat_completion(
      messages = [
          {
              "role": "system",
              "content": "You are an assistant who perfectly describes images."
          },
          {
              "role": "user",
              "content": [
                  {"type": "image_url",
                   "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
                  {"type" : "text",
                   "text": request}
              ]
          }
      ],
      stream=True,
      temperature=0.2
    )

    # Get characters from stream
    output = ""
    for r in out_stream:
        data = r["choices"][0]["delta"]
        if "content" in data:
            print(data["content"], end="")
            sys.stdout.flush()
            output += data["content"]

    return output

Here, I also use streaming, which is more convenient for debugging; we can see the output on the console. As a test, let's ask the model to describe this image:

To do this, we need only two lines of code:

img = Image.open('cat.jpg')
model_inference(model, "Please describe the image", img)

#> The image features a gray cat wearing sunglasses, giving it an adorable
#> and stylish appearance. The cat is sitting on a table or countertop,
#> with its eyes peering through the yellow lenses of the sunglasses. The 
#> scene appears to be indoors, possibly in a living room or kitchen area.

The result is not bad for a free, open-source model! This part works, and we can go further.

Text-to-speech (TTS)

After receiving a text from the model, we want to play it as audio. To generate audio from text, I will use an MMS-TTS (Massively Multilingual Speech) project from Facebook. To use this model with the HuggingFace Transformers library, we only need several lines of code:

from transformers import VitsModel, VitsTokenizer

model = VitsModel.from_pretrained("facebook/mms-tts-eng")
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng")

text = "Hello World"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
    data = model(**inputs).waveform.cpu().numpy()

This code generates raw audio data with a 16,000 KHz sampling rate. To play it in the web browser, we need to convert this data into WAV format:

from io import BytesIO
import scipy

buffer = BytesIO()
data_int16 = (data * np.iinfo(np.int16).max).astype(np.int16)
scipy.io.wavfile.write(buffer, rate=sample_rate, data=data_int16.squeeze())
data_wav = buffer.getbuffer().tobytes()

Practically, this model works well for our task. I also tried a Bark TTS model from Suno. It provides better quality (24,000 vs. 16,000 KHz sample rate), but the model file is 10x bigger (1.5 GB instead of 150 MB) and the audio generation time is several times longer.

Streamlit

Streamlit is a Python library that allows us to easily create an interactive web app. At the beginning, let's create two helper methods to load and cache the model:

import streamlit as st

# LLaVA Model
model_mmproj_file = "llava-v1.5-7b-mmproj-f16.gguf"
model_file = "llava-v1.5-7b-Q4_K.gguf"

@st.cache_resource
def load_chat_handler():
    """ Load LLAVA chat handler """
    return Llava15ChatHandler(clip_model_path=model_mmproj_file)

@st.cache_resource
def load_model():
    """ Load model """
    chat_handler = load_chat_handler()
    return Llama(
        model_path=model_file,
        chat_handler=chat_handler,
        n_ctx=2048,
        n_gpu_layers=-1,  # Set to 0 if you don't have a GPU,
        verbose=True,
        logits_all=True,
    )

Here, a cache_resource decorator is required to avoid reloading the model whenever the user presses a "Refresh" button in the browser. Streamlit already has camera support, so we don't need to write a lot of code for it. I only need to create helper methods to generate text and audio from an image:

def st_describe(model: Any, tts: Any, prompt: str, image: Image) -> str:
    """ Describe image with a prompt in browser """
    with st.spinner('Describing the image...'):
        response = model_inference(model, prompt, image)
    st_generate_audio(tts, response)

def st_generate_audio(tts: Any, text: str):
    """ Generate and play the audio"""
    with st.spinner('Generating the audio...'):
        wav_data = tts.generate_audio(text)    
    st_autoplay(wav_data)

def st_autoplay(wav: bytes):
    """ Create an audio control in browser """
    b64 = base64.b64encode(wav).decode()
    md = f"""
         
         """
    st.markdown(md, unsafe_allow_html=True)

Here, I used an HTML "audio" control with autoplay enabled, so the sound will start automatically when the generation is finished.

After that, the program logic can be written in less than 20 lines of code:

def main():
    """ Main app """
    st.title('LLAVA AI Assistant')

    with st.spinner('Loading the models, please wait'):
        model = load_model()
        tts = load_tts_model()

    img_file_buffer = st.camera_input("Take a picture")
    if img_file_buffer:
        cam_image = Image.open(img_file_buffer)
        if st.button('Describe the image'):
            st_describe(model, tts, "Please describe the image.", cam_image)
        if st.button('Read the label'):
            st_describe(model, tts, "Read the text on the image. If there is no text, write that the text cannot be found.", cam_image)

Here, I added two buttons to the user interface – a first button will describe an image, and a second one will read the text on the image (as a reminder, this prototype is intended for people with visual impairments). A LLaVA model can do much more; for example, we can try to ask about what kind of meals we can make from the groceries in the photo. Readers are welcome to do more experiments on their own.

As a final result, we get a web application that can run on a desktop or even on a smartphone:

As an aside note, for testing in Google Chrome on a local network, the server address must be added to an "Insecure origins treated as secure" list; otherwise, the web camera will not work without HTTPS.

Conclusion

In this article, we made a prototype of a system that can help people with visual impairments. With the LLaVA model, camera, and speech synthesizer, we can analyze the images of the real world and ask different questions about their context. A model and a web server can run fully offline for free on a PC or laptop. It can also be hosted in the cloud, so a smartphone may be a more portable solution. In this case, a custom-made device with speech recognition and a push-to-talk button can be even better; fully blind people may interact with the model using speech and get answers with a speech synthesizer.

Obviously, running a cloud GPU instance is not free, and personally, I am not a fan of paid APIs and subscriptions. However, the LLaVA model is computationally expensive; it requires at least an 8GB GPU to run smoothly. In this case, a low-cost device and the cloud API may be cheaper compared to high-end hardware capable of running the same model locally. So, there are many options to think about, both from tech and cost optimization perspectives. On the other side, all modern smartphones will apparently be able to run ML models soon. There is already a PyTorch version for Android and iOS, though I am not sure if it is capable of running really large models. Anyway, the progress in this area is going fast, and I hope that AI solutions can make people's lives better.

Thanks for reading. If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. You are also welcome to connect via LinkedIn. If you want to get the full source code for this and other posts, feel free to visit my Patreon page.

Those who are interested in using language models and natural language processing are also welcome to read other articles:

Tags: AI Data Science Llava Programming Visual Assistance