Improve tabular data prediction with Large Language Model through OpenAI API

Author:Murphy | View: 29551 | Time: 2025-03-23 18:07:22

These days, Large Language Models and the applications or tools are all over the news and social medium. The GitHub trending page showcases a substantial presence of repositories that extensively utilize large language models. We have seen the fantastic capabilities of large language models to create writing for marketing, summarize documents, compose music and generate code for software development.

Enterprises have the abundance of tabular data(one of the oldest and most ubiquitous formats of data that can be represented in a table with rows and columns) accumulated internally and online. Can we apply large language models on tabular data in the traditional machine learning lifecycle to improve model performance and add business value?

In this article, we will explore the following topic with full python implementation code:

Building generalized linear models and tree-base model on Kaggle Heart Attack Analysis & Prediction Dataset public dataset.
Prompt engineering to transform tabular data into text
Zero-shot classification with OpenAI API (GPT-3.5 model: text-davinci-003)
Boosting Machine Learning model performance with OpenAI embedding API— text-embedding-ada-002
Prediction explainability with OpenAI API – gpt-3.5-turbo

Dataset description

The data is available on Kaggle's website under license CC0 1.0 Universal (CC0 1.0) Public Domain Dedication which is no copyright (you can copy, modify, distribute and perform the work, even for commercial purposes). Please refer to the below-mentioned link:

Heart Attack Analysis & Prediction Dataset

It contains demographics features, medical conditions and target. The columns are explained below:

age: age of the applicant
sex: sex of the applicant
cp: chest pain type: value 1 is typical angina, value 2 is atypical angina, value 3 is non-anginal pain and value 4 is asymptomatic.
trtbps: resting blood pressure (in mm Hg)
chol: cholestoral in mg/dl fetched via BMI sensor
fbs: fasting blood sugar > 120 mg/dl, 1 = True, 0 = False
restecg: resting electrocardiographic results
thalachh: maximum heart rate achieved
exng: exercise induced angina (1 = yes; 0 = no)
oldpeak: previous peak
slp: slope
caa: number of major vessels
thall: thal rate
output: target variable, 0= less chance of heart attack, 1= more chance of heart attack

Machine learning models

Binary classification models are developed to predict the likelihood of having heart attack. This section will cover:

Pre-processing: missing value check, one-hot encoding, train test stratified split, etc.
Building 4 models including three generalized linear models and one tree-based model: Logistic Regression, Ridge, Lasso and Random Forest
Model evaluation with AUC

First, let's import package, load data, pre-process and train test split.

import warnings
warnings.filterwarnings("ignore")

# Math and Vectors
import pandas as pd
import numpy as np

# Visualizations
import plotly.express as px

# ML
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import concurrent.futures

# Utils functions
from utils import prediction, compile_prompt, get_embedding, ml_models, create_auc_chart, gpt_reasoning
pd.set_option('display.max_columns', None)

# load data
df = pd.read_csv("./data/raw data/heart_attack_predicton_kaggle.csv")
df.shape

# check missing value
df.isna().sum()

# check outcome distribution
df['output'].value_counts()

# one-hot encoding
cat_cols = ['sex','exng','cp','fbs','restecg','slp','thall']
df_model = pd.get_dummies(df,columns=cat_cols)
df_model.shape

# train test stratified split
# Seperate dependent and independent variables
X = df_model.drop(axis=1,columns=['output'])
y = df_model['output'].tolist()
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.2, random_state=101,
                                                stratify=y,shuffle=True)

Now, let's build the model object, fit the model, do the prediction on the test set and calculate AUC.

## model function
def ml_models():
    lr = LogisticRegression(penalty='none', solver='saga', random_state=42, n_jobs=-1)
    lasso = LogisticRegression(penalty='l1', solver='saga', random_state=42, n_jobs=-1)
    ridge = LogisticRegression(penalty='l2', solver='saga', random_state=42, n_jobs=-1)
    rf = RandomForestClassifier(n_estimators=300, max_depth=5, min_samples_leaf=50, 
                                max_features=0.3, random_state=42, n_jobs=-1)
    models = {'LR': lr, 'LASSO': lasso, 'RIDGE': ridge, 'RF': rf}
    return models

models = ml_models()
lr = models['LR']
lasso = models['LASSO'] 
ridge = models['RIDGE'] 
rf = models['RF'] 

pred_dict = {}
for k, m in models.items():
    print(k)
    m.fit(X_tr, y_tr)
    preds = m.predict_proba(X_val)[:,1]
    auc = roc_auc_score(y_val, preds)
    pred_dict[k] = preds
    print(k + ': ', auc)

Next, let's visualize and compare the model performance(AUC).

In this visualization:

Tree-based model (Random Forest) performs best with much higher AUC.
3 generalized linear models have similar level of performance and AUC is lower than tree based model which is expected.

Zero-shot classification with OpenAI API

We will perform zero-shot classification on the tabular data with OpenAI API which is based on text-davinci-003 model. Before we deep dive into the python implementation, let's understand a bit more about zero-shot classification. The definition from Hugging face is:

Zero Shot Classification is the task of predicting a class that wasn't seen by the model during training. This method, which leverages a pre-trained language model, can be thought of as an instance of transfer learning which generally refers to using a model trained for one task in a different application than what it was originally trained for. This is particularly useful for situations where the amount of labelled data is small.

In zero-shot classification, a prompt and a sequence of text that describes what we want the model to do is provided to the model and without any example of expected behaviour. This section will cover:

Pre-processing of tabular data for Prompt Engineering
Prompting LLMs
Zero-shot prediction with GPT-3.5 API: text-davinci-003
Model evaluation with AUC

Pre-processing of tabular data

First, let's process the data before prompting:

df_gpt = df.copy()
df_gpt['sex'] = np.where(df_gpt['sex'] == 1, 'Male', 'Female')
df_gpt['cp'] = np.where(df_gpt['cp'] == 1, 'Typical angina', 
                       np.where(df_gpt['cp'] == 2, 'Atypical angina', 
                       np.where(df_gpt['cp'] == 3, 'Non-anginal pain', 'Asymptomatic')))
df_gpt['fbs'] = np.where(df_gpt['fbs'] == 1, 'Fasting blood sugar > 120 mg/dl', 'Fasting blood sugar <= 120 mg/dl')
df_gpt['restecg'] = np.where(df_gpt['restecg'] == 0, 'Normal', 
                       np.where(df_gpt['restecg'] == 1, 'Having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)', 
                                    "Showing probable or definite left ventricular hypertrophy by Estes' criteria"))
df_gpt['exng'] = np.where(df_gpt['exng'] == 1, 'Exercise induced angina', 'Without exercise induced angina')
df_gpt['slp'] = np.where(df_gpt['slp'] == 0, 'The slope of the peak exercise ST segment is downsloping', 
                       np.where(df_gpt['slp'] == 1, 'The slope of the peak exercise ST segment is flat', 
                                    'The slope of the peak exercise ST segment is upsloping'))
df_gpt['thall'] = np.where(df_gpt['thall'] == 1, 'Thall is fixed defect', 
                       np.where(df_gpt['thall'] == 2, 'Thall is normal', 'Thall is reversable defect'))

# test df to dict
application_list = X_val.to_dict(orient='records')
len(application_list)

Prompting LLMs

Prompts are powerful tool to interact with large language models for a certain task. A prompt is a user-provided input to which the model is meant to respond. Prompts can be various forms, i.e., text, an image.

In this article, the prompt includes instructions with expected JSON output format and the question itself. With the heart attack dataset, a text prompt can be as:

Next, we will define the prompt and API call function which constructs the prompt and gets the response from OpenAI-3.5 API.

def prediction_GPT3_5(data, explain = False):
    if explain:
        prompt = prompt_logic(explain)
    else:    
        prompt = prompt_logic(explain)
    print(prompt)
    response = openai.Completion.create(
        model = 'text-davinci-003',
        prompt=prompt,
        max_tokens=64,
        n=1,
        stop=None,
        temperature=0.5,
        top_p=1.0,
        frequency_penalty=0.0,
        presence_penalty=0.0
    )

    try:
        output = response.choices[0].text.strip()
        output_dict = json.loads(output)
        return output_dict
    except (IndexError, ValueError):
        return None

def prediction(combined_data_argu):
    application_data, explain = combined_data_argu
    response = prediction_GPT3_5(application_data, explain)
    return response

Get the API response – multiprocessing

Multiprocessing is utilized to speed up the API call. The code is:

### get prediction from GPT-3.5 model: text-davinci-003 - multiprocessing pool
with concurrent.futures.ThreadPoolExecutor() as executor:
    # Combine credit_data and explain into a single iterable
    combined_data = zip(application_list, [False] * len(application_list))
    # Submit the transaction processing tasks to the executor
    results = executor.map(prediction, combined_data)

    # Collect the responses into a list
    responses = list(results)
responses_df = pd.DataFrame(responses)
responses_df.shape

Zero-shot classification AUC

The AUC is 0.48 for zero-shot classification which suggests the predictions are worse than random chance and indicates that potentially there is no leakage in the GPT-3.5 text-davinci-003 model on this dataset.

auc_gpt= roc_auc_score(y_val, responses_df['output'])
auc_gpt

Boost machine learning model performance with OpenAI embedding

LLM embedding is a endpoint of large language model(i.e., OpenAI API) that makes it easy to perform natural language and code tasks like semantic search, clustering, topic modelling and classification. With prompt engineering, the tabular data is transformed into natural language text which can be utilized to generate the embeddings. The embeddings have the potential to improve traditional machine learning model performance by enabling them to better understand the natural language and adapt context with small amount of labelled data. In short, it is one type of feature engineering in this context.

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.

In this section, you will see:

How to get OpenAI embeddings through API call
Model performance comparison – with vs without embedding features

First, let's define the function to get the embeddings through API and merge with raw dataset:

# define function to fetch the embedding
def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

# API call and merge with raw data
df_gpt['ada_embedding'] = df_gpt.combined.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
df_gpt = df_gpt.join(pd.DataFrame(df_gpt['ada_embedding'].apply(pd.Series)))
df_gpt.drop(['combined', 'ada_embedding'], axis = 1, inplace = True)
df_gpt.columns = df_gpt.columns.tolist()[:14] + ['Embedding_' + str(i) for i in df_gpt.columns.tolist()[14:]]
df = pd.concat([df, df_gpt[[i for i in df_gpt.columns.tolist() if i.startswith('Embedding_')]]], axis=1)
df_gpt.shape

Similar as pure machine learning models, we will also conduct stratified split and fit the model:

# Seperate dependent and independent variables
X = df.drop(axis=1,columns=['output'])
y = df['output'].tolist()

X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.2, random_state=101,
                                                stratify=y,shuffle=True)
models = ml_models()
lr = models['LR']
lasso = models['LASSO'] 
ridge = models['RIDGE'] 
rf = models['RF'] 
pred_dict_gpt = {}
for k, m in models.items():
    print(k)
    m.fit(X_tr, y_tr)
    preds = m.predict_proba(X_val)[:,1]
    auc = roc_auc_score(y_val, preds)
    pred_dict_gpt[k + '_With_GPT_Embedding'] = preds
    print(k + '_With_GPT_Embedding' + ': ', auc)

Model performance comparison – with vs without embedding features

By combining the models without embedding features, we have 8 models in total. The ROC curve on the test set is below:

pred_dict_combine = dict(list(pred_dict.items()) + list(pred_dict_gpt.items()))
create_auc_chart(pred_dict_combine, y_val, 'Model AUC')

Generally, we observe:

Embedding features are not significantly improving the generalized linear model performance(Logistic Regression, Ride and Lasso)
Random Forest model with embedding features performs best and is slightly better than the random forest model without embedding features.

We see the potential of large language models to be integrated into the traditional model training process and improve the quality of output. We may have the query, can large language models help explain the model decision? Let's touch this in the next section.

Model explainability with OpenAI API – gpt-3.5-turbo

Model explainability is one of the key topics in machine learning applications especially in areas such as insurance, healthcare, finance and law where users need to understand how a model makes a decision at local and global level. I have written an article related to Deep Learning Model Interpretation Using SHAP if you'd like to know more about model interpretation on deep learning models.

This section covers:

Preparing input for the OpenAI API
Getting reasoning through gpt-3.5-turbo model

First, let's prepare the input for the API call.

application_data = application_list[0]
application_data

{'age': 51,
 'sex': 'Male',
 'cp': 'Atypical angina',
 'trtbps': 125,
 'chol': 245,
 'fbs': 'Fasting blood sugar > 120 mg/dl',
 'restecg': 'Normal',
 'thalachh': 166,
 'exng': 'Without exercise induced angina',
 'oldpeak': 2.4,
 'slp': 'The slope of the peak exercise ST segment is flat',
 'caa': 0,
 'thall': 'Thall is normal'}

Next, let's get the reasoning by calling the gpt-3.5-turbo API.

message_objects = [
        {"role": "system", "content": '''You are a medical expert / underwriter in a global insurance company. Your job is to evaluate the chance of having heart attack. Please encode your response as json in the following format
        {{
            "decision": "",
        }}'''},
        {"role": "user", "content": prompt},
    ]

completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=message_objects,
        max_tokens=1000,  # Adjust the max_tokens as per your desired response length
        stop=None,  # Set custom stop conditions if required
    )

# Extract the response message content
    response_content = completion.choices[0].message["content"]

The response is quite impressive and the large language model shows strong reasoning capabilities along with reasonable explanations.

{ "decision": "less chance of heart attack", "reasoning": "Based on the information provided, the applicant has several factors that indicate a lower chance of having a heart attack. Firstly, the age of the applicant is 51, which is not considered young but also not in the high-risk range for heart attacks. Secondly, the applicant is male. While men generally have a higher risk than women for heart attacks, it is not the sole determining factor. Thirdly, the chest pain type reported by the applicant is atypical angina. Atypical angina is characterized by chest pain that is less predictable and may have different patterns compared to typical angina. This can indicate a lower risk of heart attack compared to typical angina.Fourthly, the resting blood pressure of the applicant is 125 mm Hg. This falls within the normal range and does not indicate hypertension, which is a risk factor for heart attacks.Fifthly, the cholesterol level of the applicant is 245 mg/dl. While this is higher than the recommended level, it is not extremely high, and the applicant's BMI is not provided, so we cannot determine if the cholesterol level is high due to obesity.Sixthly, the applicant does not have fasting blood sugar level above 120 mg/dl, which indicates a lower risk of diabetes, another risk factor for heart attacks.Seventhly, the resting electrocardiographic results of the applicant are normal, which indicates normal heart function and reduces the risk of heart attack.Eighthly, the maximum heart rate achieved by the applicant is 166, which is a good sign as it indicates a healthier cardiovascular system.Ninthly, the applicant does not experience exercise-induced angina, which is another positive factor.Tenthly, the ST depression induced by exercise relative to rest is 2.4, which is within the normal range and does not indicate significant ischemia.Eleventhly, the slope of the peak exercise ST segment is flat, which could be a normal finding or related to the atypical angina reported by the applicant.Lastly, the applicant has no major vessels and a normal Thall, indicating a lower risk of coronary artery disease.Considering all these factors, it is likely that the applicant has a lower chance of having a heart attack. However, it is important to note that this evaluation is based solely on the provided information and further medical assessment may be necessary to make a definitive determination." }

Summary

Large language models are powerful tool to solve a wide range of use cases in various industries. Creating LLM applications is easier and increasingly affordable. LLMs will definitely add real business value to the enterprise.

Let's stay in touch…

I welcome you to join me on an exciting and fruitful data science learning adventure! Stay connected by following my Medium page for a constant stream of captivating data science content. I will share more machine learning basic, NLP basic, end to end data science implementation in coming months. Cheers!

Reference

What is Zero-Shot Classification? – Hugging Face

An Introduction to Large Language Models: Prompt Engineering and P-Tuning | NVIDIA Technical Blog

Deep Learning Model Interpretation Using SHAP

Tags: Explainable Ai Large Language Models Machine Learning OpenAI Prompt Engineering