Data Analysis Made Easy: Using LLMs to Automate Tedious Tasks

Author:Murphy | View: 29525 | Time: 2025-03-23 18:45:55

A high quality digital art view of a robot in the centre, who is able to do technical coding, write amazing prose and do strategic thinking (author created, with DALL-E).

Data analysis can be equal parts challenging and rewarding. From cleaning messy datasets to building complex models, there's always a lot to do, and not enough time to do it. But what if there was a way to streamline and automate some of the more routine tasks, freeing up more time for strategic thinking and decision-making? That's where LLMs come in.

Large Language Models (LLMs) are AI language models that can assist with a wide range of natural language processing tasks, from generating text to answering questions. And as it turns out, they can also be a valuable tool for data analysts. In this article, we'll explore some of the ways you can use LLMs in your day-to-day work as a data analyst, and show you how AI can help you work smarter, not harder.

Let's jump straight into it.

Note: these systems are not (yet) an end-to-end analyst solution that will replace you. Stay tuned to the space though.

How LLMs can help

LLMs can act as AI-powered chatbots that can assist with streamlining and automating tasks related to data analysis. With their advanced capabilities, LLMs can help with a variety of tasks. I've classified them into three broad categories:

Technical: This category includes some of the most widely seen applications that generally involve coding, including writing code and documentation, cleaning data, answering coding questions, running data analyses and visualising data.
Soft: This category covers the soft-skills that are often necessary to be a successful data analyst. AI can help drafting documents to communicate out findings, collecting data requirements from partners and summarising meeting notes.
Strategic: Maybe the most valuable part that data analysts can offer is their strategic thinking which can also be enhanced with AI. These include brainstorming what analyses to run, creating broad understanding frameworks, improving and iterating on your analytical approach and as a general thought-partner.

Putting all these into practice can save a significant amount of time and effort throughout your time as a data analyst.

Let's explore some examples of these to see just how powerful and flexible the tools are today.

Examples of using LLMs

This section will contain examples of the application of LLMs. The examples are mostly indicated with a different format to highlight the responses, otherwise you might confuse it with my writing!

Throughout this article as we prompt the LLM, we'll use an excerpt from the Spotify and Youtube songs dataset that includes the column information and first 20 rows of the dataset. In the future, an LLM could have direct access to the dataset, as well as well as responses to code it executes on the dataset, to remove the limitations of such a small sample.

A Technical Wizard

LLMs trained on codebases are competent coders as seen in this article. This means that they can readily solve common technical data analyst tasks. They can perform very well with tasks that require little context and technical tasks more commonly fall into this bucket.

One of the most basic tasks of a data analyst is performing an exploratory data analysis (EDA). LLMs are able to write Python code that can explore a dataset as well as output the corresponding images. The following quoted text contains responses returned by an LLM when prompted to do an EDA:

Read in csv files and display examples: "df = pd.read_csv("filename.csv") df.head()"
Identify columns of interest and explore: "Group the data by Artist and check the count of songs by each artist. df.groupby('Artist')['song name'].count()"
Create plots: "Create a histogram of the danceability column to see the distribution. plt.hist(df['danceability'], bins=20)"

These are fairly basic, but LLMs can be pushed further to provide other EDA approaches. In this case, they're able to offer up outlier detection and clustering as further ways to explore the data, among others.

An example of the response given is below. Reading through this, it could definitely pass as work by a junior analyst, identifying some insights such as:

"The danceability values for the Gorillaz songs are generally higher than that of Red Hot Chili Peppers."

"The data is about songs from two artists, Gorillaz and Red Hot Chili Peppers. Gorillaz has 9 songs in the given data while Red Hot Chili Peppers has 4 songs.

The danceability and energy values for the songs vary widely between 0.559 to 0.938 and 0.345 to 0.938, respectively. The danceability values for the Gorillaz songs are generally higher than that of Red Hot Chili Peppers. The energy values for Gorillaz songs also tend to be higher than that of Red Hot Chili Peppers.

The keys for the songs range from 1.0 to 11.0, with no discernible pattern or trend."

(full output)

When creating tables, it can be a little annoying to write good descriptions for columns and the overall table. While this is good practice, it is often overlooked. LLMs can do this by guessing the context of the table, or even better, can write them when some context is provided.

"This table appears to show some information about various songs by two different artists – Gorillaz and Red Hot Chili Peppers. Each row in the table represents a different song, and there are 10 columns of information provided for each song. Here are the descriptions of the columns:

Index: This is just an integer index, which appears to be a unique identifier for each song in the table.

Artist: This column lists the artist who created the song.

Link: This column provides a link to the artist's page on Spotify.

Song Name: This column lists the name of the song.

Album Name: This column lists the name of the album that the song appears on.

Album: This column indicates whether the row represents an album or a single.

Source Identifier: This column appears to be a unique identifier for each song, possibly specific to Spotify.

Danceability: This column provides a score from 0 to 1 indicating how suitable the song is for dancing.

Energy: This column provides a score from 0 to 1 indicating the energy level of the song.

Key: This column provides a number from 0 to 11 indicating the musical key of the song.

Overall, this table appears to be a collection of song data, including information about the artist, album, and various musical features of each song."

Some impressive parts of these results is that the LLM has looked through the data samples to work out that the "Key" column is not some key for the rows, but the musical key for the song, and the "Album" column describes whether it is an album or single. A typical analyst wouldn't be able to guess these just looking at the column names and would have to waste time doing the basic EDA to work out what value the data source can provide – LLMs save this time.

The main use-case today for augmented intelligence in data analysis is in technical question answering by the likes of Stack Overflow. However LLMs are often superior, providing the correct code for your specific use-case and variables, as well as the ability to give deeper answers or tutorials as a follow-up.

For example, plotting a Wordcloud is something that is required every now and then, but it's difficult to remember the exact library and code to use to plot one. You can simply ask an LLM, and it will happily return you working code. The key excerpt from that below:

# Generate the wordcloud

wordcloud = WordCloud(width = 800, height = 800, background_color ='white', stopwords = set(), min_font_size = 10).generate(wordcloud_text)

I realised that this is using just a list of words but I had word counts, so I quickly followed up "Can I feed in weighted words instead?" and the LLM re-did the code:

# Generate the wordcloud

wordcloud = WordCloud(width = 800, height = 800, background_color ='white', stopwords = set(), min_font_size = 10).generate_from_frequencies(weighted_words)

(full output)

This wouldn't have been possible with Stack Overflow, and saved plenty of time Googling or looking through old code.

As can be seen, LLMs are very capable at technical questions and can save time in the data analysis workflow, but what else are they capable of?

A Soft Approach from AI

Soft-skills may be overlooked as a requirement for data analysis, but they're crucial to ensure any analysis is useful for the business.

Any data analyst knows that understanding what data the people you work with actually want is one of the toughest parts. This can go wrong in many ways, from the partner making a simple mistake in what something is called, to extending what they want as soon as you return the first part, to asking for the wrong data that doesn't support the conclusion they want to make. Due to their conversational nature, LLMs are great at collecting requirements from partners. An example conversation is below.

"I need you to pull the latest animated sticker numbers.

Great, thanks for your request! Just to clarify, when you say "animated sticker numbers", can you please provide more information on what exactly you mean by that?

Are you looking for data on the number of animated stickers that have been created? Or are you interested in the usage metrics of animated stickers, such as the number of times they've been viewed or downloaded?

Additionally, could you please specify the time frame you're interested in? Are you looking for data on the latest animated sticker numbers for a specific day, week, or month? Finally, would you like me to segment the data in any way, such as by sticker category or by user demographic?"

Practically, it may be difficult to ask your partners to go chat to an LLM first, but LLMs are useful for figuring out if there are things you should have clarified further before kicking off work on a project, and in the future will be integrated into analytics solutions directly.

Once the requirements are collected and the analysis is done, the findings must be communicated back. LLMs already have a track record of helping writers to produce content faster, and of a higher quality. In the same way, an LLM can help with preparing documents summarising analysis work, or prepping presentations. Some particular tasks that LLMs are useful at are basic introductions and conclusions, as well as text summarisation.

"Based on this EDA, the song recommendation service might consider adding more Gorillaz songs to their recommendation algorithm and prioritize songs with higher danceability and energy values, as these seem to be popular among their users."

(full output)

Some tips for using LLMs for writing can be found in this post, but there are many more out there and more will be added as we continue how best to work with LLMs.

We've now seen how LLMs can solve technical tasks and assist with a lot of the ‘soft' tasks. This is already great for improving an analyst's workflow, but can we take this further?

The Grand Command

Strategic thinking is one of the highest value an analyst can offer to a company. This can be enhanced with LLMs. Some the ways this can be done include brainstorming what analyses to run, creating broad understanding frameworks, improving and iterating on your analytical approach and as a general thought-partner.

Let's start with brainstorming. You may have been tasked with the generic task of "understanding more about your customers". What sort of analyses could you do about this? What outcomes would they lead to? Ideally you would brainstorm with your peers, but that's not always possible, so you can brainstorm with an AI instead.

When asked what analyses we should run to understand our customers better, the outputs included: "Customer segmentation, RFM analysis, Basket analysis, Churn analysis, Customer lifetime value." With some predicted outcomes:

"The outcomes of these analyses will depend on the specific insights you uncover. For example, you might discover that your most valuable customers are older, wealthy women who are interested in specialty cooking items. This could lead you to develop more products targeted towards this demographic and to invest in marketing campaigns that resonate with them."

(full output)

That's a great starting point, and can make sure you're covering all the gaps, which can be difficult when working alone. LLMs can also give advice on frameworks. You might be wondering how you can grow your business over the next few years, and how you might measure that, well:

"There are several growth frameworks that can be used to understand how to grow sales over the next three years. Here are three popular frameworks:

AARRR: This framework stands for Acquisition, Activation, Retention, Referral, and Revenue. It is a five-step framework developed by Dave McClure that is commonly used in startup circles. The AARRR framework helps companies understand the different stages of the customer lifecycle and how to optimize each stage to drive growth.

RACE: This framework stands for Reach, Act, Convert, Engage. It is a four-stage framework developed by Smart Insights that is focused on digital marketing. The RACE framework helps companies understand how to optimize their digital marketing efforts to drive growth.

3HAG: This framework stands for Three-Year Highly Achievable Goal. It is a strategic planning framework developed by Shannon Susko that helps companies set and achieve ambitious growth goals. The 3HAG framework involves defining a BHAG (Big Hairy Audacious Goal) and then breaking it down into smaller, achievable goals over the course of three years.

When deciding which growth framework to use, it's important to consider your company's unique situation and goals. AARRR and RACE are more tactical frameworks that focus on optimizing specific stages of the customer lifecycle or digital marketing efforts, while 3HAG is a more strategic framework focused on setting ambitious growth goals."

As you can see, the response is helpful and could certainly stand in for a response you'd expect from an analyst in your team.

Which LLM to use?

As of writing this, there's really only one LLM that can do all the above tasks well, and that's OpenAI's ChatGPT. It was the first to popularise LLMs and remains one of the strongest offerings on the market, and is free (with a subscription option).

With the pace of change we're seeing in AI, it's very possible this won't be true in a few months, so it's worth noting that there are plenty of other competitors. For example, Google is developing their product, Bard, which is expected to perform similarly to ChatGPT. There are also many open source alternatives to consider. While these are generally not of the same quality, they're expected to keep improving and close the gap between commercially operated models.

Tips for using LLMs effectively

To get the most out of LLMs as a data analyst, there are a few tips you can follow. First, it's important to give clear and specific inputs to LLMs. This means using proper language, avoiding ambiguity, and providing context where necessary. Additionally, LLMs can work with both structured and unstructured data, so it's worth experimenting with different input formats to see which works best for a given task. Finally, it's important to remember that LLMs are a tool, not a replacement for human analysis. While it can help automate some routine tasks, it's still up to the data analyst to interpret the results and make informed decisions based on the data.

There are plenty of articles out there such as this one discussing how to work with LLMs and it's a growing field of study, so keep learning!

Conclusion

In conclusion, LLMs are a great tool to improve the efficiency of your analytics work and even to grow and learn new things. LLMs can help with technical problems, develop soft skills and improve your strategic thinking. Working with AI is the future, so now is the best time to start learning how to integrate it into your workflow so you're not left behind.

Tags: Artificial Intelligence ChatGPT Data Analysis Data Science Software Engineering