Large Models Meet Big Data: Spark and LLMs in Harmony

Author:Murphy  |  View: 29949  |  Time: 2025-03-22 23:49:24

DATA ENGINEERING & GENERATIVE AI

The image is generated by Midjourney.

Generative AI, including Large Language Models (LLMs), is revolutionizing different aspects of human life. Over the past five years, Generative AI has evolved from a research project into a real-life application for many people. As a data engineer interested in Generative AI, I have always asked myself, what does this technology bring to my work and Data Engineering applications? There are some common applications of Gen AI and LLMs for engineers such as pilot coding, assisting in documentation, and so on. But, here, I am evaluating some of the more specialized uses of Gen AI and LLMs for data engineering. If you are interested in this topic, please read this article and follow me on Medium and Linkedin to get more articles about other use cases.

LLMs: Powerful Tools for Transformations

It is not new that data engineers love structured and abstracted data. But, the world is full of unstructured and disorganized data that requires the attention of data engineers. Transformations on unstructured data are always complicated and sometimes impossible with traditional tools. Historically, one of these challenging unstructured data was text (e.g. comments, reviews, conversation). Simple transformations on texts were not a big deal, but complicated transformations can extract more information from texts and we can make more rich data sets.

Examples of complicated text transformations could be extracting names and objects from a text, sentiment analysis on a review or a comment, masking important information (e.g. private data, user data) in the stored texts, translating from one language to a standard language, text summarization, and so on. The good news is nowadays LLMs can do all sorts of these transformations. Therefore, I believe one of hundreds LLMs applications in data engineering, is to act as transform functions for complicated data such as texts.

In this article, I will show this ability of LLMs via Apache Spark, a powerful distributed data processing system. More specifically, I am going to use, a small LLM (t5-small) from Hugging Face as an Apache Spark UDF function and apply a specific transformation (sentiment analysis) to a sample data set.

Set up the project

First, we need a system with Apache Spark. You can either install Apache Spark on your local system or use services such as AWS EMR. I used AWS EMR and set up a small cluster for testing. There are many articles on how to install Apache Spark on your local machine or use AWS EMR which you can use them. Here, I am assuming you already have Apache Spark on your system.

I also chose Python 3.8 for this project. We are going to use Hugging Face libraries and they are well aligned with Python. If you are using AWS EMR, you should already have it on your system. Otherwise, you need to install Python 3.8.

Now, that you have both Apache Spark and Python 3 on your system, it is time to install the necessary libraries for testing this project. Run the following pip commands to install libraries. Here we are installing PySpark to run Spark jobs, and the Transformers library from Hugging Face (which enables us to use hundreds of models including LLMs).

pip3 install torch==1.13.1
pip3 install transformers==4.30.2
pip3 install pyspark==3.4.2
pip3 install urllib3==1.26.6

Coding Time

Start with creating a new Python file. I named it spark_llm_test.py. First, we need to import libraries as well as create a new spark session.

from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import udf
from pyspark.sql.types import StructType, StructField, StringType, LongType
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Create a Spark session
spark = SparkSession.builder.appName("FlanT5Seq2SeqExample").getOrCreate()

For this test, I created a dummy table with only two columns. The first column is an index and the second column is a random sentence. My goal for this project is to do sentiment analysis on the second column. Obviously, in a real-life use case, you probably read your Spark dataframe from a table (e.g. Hive Table) with much more complex data.

# Create an Example Spark DataFrame
schema = StructType([
  StructField("id", LongType(), nullable=False),
  StructField("sentence", StringType(), nullable=False)
])

data = [
  Row(1, "It is a good test for Spark."),
  Row(2, "Spark DataFrames are powerful."),
  Row(3, "LLMs could be very slow."),
  Row(4, "It is a naive statement.")
]

input_df = spark.createDataFrame(data, schema=schema)

For this purpose, I use the "Flan T5" model which is fine-tuned for various tasks, including sentiment analysis. As you see here, the model and tokenizer setup is very easy using the Hugging Face Transformers library.

# Loading Flan T5 Model and Tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

Now it is time to define our Spark UDF function. Spark will call this function on every row of our data to transform the data as we specify in this function.

One important point regarding this UDF function. As you see, we did not instantiate the model and tokenizer inside the UDF function. The reason is that defining the model and tokenizer (even such a small LLM model) takes a large overhead for our processing. It means for every row that we call this UDF function, the model and tokenizer should loaded on the worker nodes and the transformation (which is an inference here) should happen after that. It definitely slows down our process significantly. To avoid it, we loaded the model and tokenizer outside of the UDF function.

# Defining the Spark UDF
def t5_seq2seq_udf(input_text):
  prompt = f"sentiment of the text: {input_text}"
  input = tokenizer(prompt, return_tensors="pt")
  output = model.generate(**input)
  output_text = tokenizer.decode(output[0], skip_special_tokens=True)
  return output_text

Finally, we need to register the UDF function and create a new column to save the sentiment analysis results.

t5_udf = udf(t5_seq2seq_udf, returnType=StringType())

results_df = input_df.withColumn('output_column', t5_udf(input_df['sentence']))

results_df.show(truncate=False)

Below, you can find the entire code.

from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import udf
from pyspark.sql.types import StructType, StructField, StringType, LongType
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Create a Spark session
spark = SparkSession.builder.appName("T5Seq2SeqExample").getOrCreate()

# Create an Example Spark DataFrame
schema = StructType([
  StructField("id", LongType(), nullable=False),
  StructField("sentence", StringType(), nullable=False)
])

data = [
  Row(1, "It is a good test for Spark."),
  Row(2, "Spark DataFrames are powerful."),
  Row(3, "LLMs could be very slow."),
  Row(4, "It is a naive statement.")
]

input_df = spark.createDataFrame(data, schema=schema)

# Loading t5 Model and Tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

# Defining the Spark UDF
def t5_seq2seq_udf(input_text):
  prompt = f"sentiment of the text: {input_text}"
  input = tokenizer(prompt, return_tensors="pt")
  output = model.generate(**input)
  output_text = tokenizer.decode(output[0], skip_special_tokens=True)
  return output_text

t5_udf = udf(t5_seq2seq_udf, returnType=StringType())

results_df = input_df.withColumn('output_column', t5_udf(input_df['sentence']))

results_df.show(truncate=False)

If you run the code using the spark-submit command, you will get the following results. As you see Flan T5 model did a good job on recognizing the positive sentences from negative ones.

+---+------------------------------+-------------+                              
|id |sentence                      |output_column|
+---+------------------------------+-------------+
|1  |It is a good test for Spark.  |positive     |
|2  |Spark DataFrames are powerful.|positive     |
|3  |LLMs could be very slow.      |negative     |
|4  |It is a naive statement.      |negative     |
+---+------------------------------+-------------+

Future of Spark and LLMs

Congratulations! You have successfully executed Flan T5 as a Spark job. To simplify, we omitted several details, including how master and worker nodes transfer model weights to conduct inferences. This topic could be extensive for future discussions on whether Spark can be efficiently employed for inferences using LLMs.

In addition, here we used Spark for data transformation (an application). But if Spark can be used efficiently in the model development process, that's another big win for Apache Spark.

And finally, it was just a simple example of batch processing. Another application of Spark and LLMs could be stream processing and providing real-time analysis on those kinds of data streams. Definitely, it was a good start to using Spark and LLMs, but the applications and collaborations between these two tools are endless.

Tags: Artificial Intelligence Data Engineering Data Science Large Language Models Programming

Comment