Illuminating Insights: GPT Extracts Meaning from Charts and Tables

Author:Murphy | View: 25410 | Time: 2025-03-22 23:36:58

Integrating visual inputs like images alongside text and speech into large language models (LLMs) is considered an important new direction in AI research by many experts in the field. By augmenting these models to handle multiple modes of data beyond just language, there is potential to significantly broaden the scope of applications they can be utilised for as well as enhance their overall intelligence and performance on existing NLP tasks.

The promise of multimodal AI spans from more engaging user experiences like conversational agents that can see their surroundings and refer to objects around them, to robots that can fluidly translate commands into physical actions using combined knowledge of language and vision. By uniting historically separate areas of AI around a unified model architecture, multimodality may accelerate progress in tasks relying on multiple skills like visual question answering or image captioning. The synergies between learning algorithms, data types, and model designs across fields could lead to rapid advancement.

Many companies have already embraced multimodality in various forms: OpenAI, Anthropic, Google (Bard and Gemini) allow you to upload your own image or text data and chat with them.

In this article, I hope to demonstrate a straightforward yet powerful application of large language models with Computer Vision in finance. Equity researchers and investment banking analysts may find this especially useful, as you likely spend considerable time reading reports and statements containing various tables and graphs. Reading lengthly tables and graphs and interpreting them correctly requires a great amount of time, knowledge in the field as well as adequate focus to avoid mistakes. More tediously, analysts occasionally need to manually enter tabular data from PDFs simply to create new charts. An automated solution could alleviate these pains by extracting and interpreting key information without the capacity for human oversight or fatigue.

In fact, by combining NLP with computer vision, we can create an assistant to handle many repetitive analytical tasks, freeing analysts to focus on higher-level strategy and decision making.

In recent years, there were a lot of advances in using Optical Character Recognition or Visual Document Understanding (image to text) to extract text from image / PDF data. However, due to the nature of currently available training data, the existing methods still struggle with complex layouts and formatting found in many financial statements, research reports, and regulatory filings.

GPT-4V(ision) for tables and graphs

Back in Sept 2023, OpenAI released GPT-4 Vision. According to OpenAI:

GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user.

Gpt-4V's visual skills come from GPT-4, so both models were trained in a similar way. First, researchers fed the system huge amounts of text to teach it the fundamentals of language. Its goal was to predict the next word in a document. Then came the finesse training using an approach called reinforcement learning from human feedback, or RLHF for short. This involves fine-tuning the model further based on positive reactions from human trainers to produce outputs we find truly helpful.

In this article, I'm going to make a Steamlit application where the user can upload an image and ask various questions about the image. The images that I am going to use are screenshots of a financial PDF document. In fact, the document is a publicly available Fund Fact Sheet.

There are two main parts to this code, first is a function to encode the image from a given file path:

# Function to encode the image from a file path
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

you need this function as the model expect your input image to be in base 64 encoded format. The next main part of the code would be the way you send your request to the OpenAI's API:

# Function to send the request to OpenAI API
def get_image_analysis(api_key, base64_image, question):
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    payload = {
        "model": "gpt-4-vision-preview",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": question},
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
                ]
            }
        ],
        "max_tokens": 300
    }

    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
    return response.json()['choices'][0]['message']['content']

where we set the model name to be gpt-4-vision-preview. As you can see this is quite different than the usual text to text OpenAI's API calls. In this case, we define a json object called payload that contains your text, as well as your image data.

You can expand the get_image_analysis method to send multiple images, or control how the model should process the images via detail parameter. See more here.

The rest of the code is mainly the Streamlit method, where we allow users to upload their image and interact with the image by asking questions.

Full code: (also available on Github)

import streamlit as st
import os
import requests
import base64
from PIL import Image

# Function to encode the image from a file path
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Function to save the uploaded file
def save_uploaded_file(directory, file):
    if not os.path.exists(directory):
        os.makedirs(directory)
    file_path = os.path.join(directory, file.name)
    with open(file_path, "wb") as f:
        f.write(file.getbuffer())
    return file_path

# Function to send the request to OpenAI API
def get_image_analysis(api_key, base64_image, question):
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    payload = {
        "model": "gpt-4-vision-preview",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": question},
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
                ]
            }
        ],
        "max_tokens": 300
    }

    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
    return response.json()['choices'][0]['message']['content']

def main():
    st.title("Image Analysis Application")

    uploaded_file = st.file_uploader("Choose an image...", type=["jpg", "jpeg", "png"], key="file_uploader")

    if uploaded_file is not None:

        file_path = save_uploaded_file('data', uploaded_file)

        # Encode the uploaded image
        base64_image = encode_image(file_path)

        # Session state to store the base64 encoded image
        if 'base64_image' not in st.session_state or st.session_state['base64_image'] != base64_image:
            st.session_state['base64_image'] = base64_image

        image = Image.open(uploaded_file)
        st.image(image, caption='Uploaded Image.', use_column_width=True)

    question = st.text_input("Enter your question about the image:", key="question_input")

    submit_button = st.button("Submit Question")

    api_key = os.getenv("OPENAI_API_KEY")

    if submit_button and question and 'base64_image' in st.session_state and api_key:
        # Get the analysis from OpenAI's API
        response = get_image_analysis(api_key, st.session_state['base64_image'], question)
        st.write(response)
    elif submit_button and not api_key:
        st.error("API key not found. Please set your OpenAI API key.")

if __name__ == "__main__":
    main()

Output and summary

Now let us look at a few examples:

Image is generated by the author. The graph is from the publicly available UBS Fund Fact Sheet.

In this example the question is asked about the peak of the performance. We can see that the peak is identified correctly by the model. The model could also understand that the dotted line is the index performance from the plot's legend. Understanding dashed and dotted lines are generally difficult in computer vision but given a good quality screenshot (with enough details), Gpt Vision could pass the task easily.

Another example:

Image is generated by the author. The material is from the publicly available UBS Fund Fact Sheet.

In this example, I have tried to examine how good is the model in: 1) extracting the relevant table among other data 2) extracting relevant part of the table 3) some basic mathematical operations.

As demonstrated, the model successfully fulfilled all three requirements for this task – no small feat given the complexity involved traditionally. Manually, an analyst would have struggled extracting the two-column table locked within a PDF, even using optical character recognition (OCR) tools. Additional coding would be needed to parse figures into a structured dataframe amenable to aggregation. This could consume substantial time before answering the original question. However here, one achieves the desired result with only a prompt. Avoiding the cumbersome workflow of deciphering images, scraping data, wrangling spreadsheets, and writing scripts unlocks tremendous efficiency.

Lastly:

Sorting algorithms systematically reorder the elements of a list or array based on a specified ordering rule. However, unlike traditional code, LLMs like GPT do not have predefined sorting routines hardcoded within.

Instead, GPT is trained to predict the next word in a sequence given prior context. With sufficient data and model capacity, the ability to sort emerges from learning textual patterns.

The example above demonstrates this – GPT correctly sorts two columns in a table extracted from a PDF screenshot, a non-trivial feat requiring optical character recognition, data extraction, and manipulation skills. Even in Excel, multi-column sorting requires some expertise. But by simply providing the goal in a prompt, GPT handles these complex steps automatically behind the scenes.

Rather than following rigid, step-by-step instructions like traditional algorithms, LLMs like GPT develop sorting capabilities through recognizing relationships in text during training. This allows them to absorb a variety of abilities from their diverse exposure, as opposed to being limited by predefined programming.

Why is this a big deal?

By harnessing this flexibility into specialised tasks like we see here, prompts can unlock efficient problem solving that would otherwise demand significant manual effort and technical knowledge.

Tags: Artificial Intelligence Computer Vision Gpt Gpt Vision Llm