Using Generative AI To Get Insights From Disorderly Data
This article shares some best practices on how we have used generative AI to analyze our data at my firm to more effectively run operations. It took a while but I was able to get approval from Marketing, Legal, Security, and PR teams at Salesforce to publish this article. I hope this helps you turbo-charge your Data Analysis as well.
All diagrams and figures in this article are directional and accurate to convey the concepts, but the data has been anonymized.
Insights-in-a-box
- Data Filtering with LLMs: No need to clean data at the source; use an LLM to purify data mid-stream.
- Python Automation with GPT: Intermediate Python skills are typically needed for data extraction, modification, and visualization, but GPT can automate and speed up these tasks
- Domain-Specific Ticket Filtering: When metadata is unreliable, filter tickets by the support engineers who worked on them.
- Reliable Data Extraction: Focus on extracting reliable fields like descriptions and timestamps, as these are less prone to errors.
- Data Anonymization with GPT: Use GPT with open-source anonymizer libraries to anonymize data before sending it to public APIs.
- Choosing Delimiters Carefully: Select output delimiters thoughtfully, ensuring they don't interfere with language model processing, and sanitize input data by removing the chosen delimiter.
- Fine-Tuning GPT prompts for Accuracy: Evaluate and fine-tune the prompt's performance on known ticket descriptions before full analysis.
- Contextual Data Limits: Be aware of GPT's upper processing limits for contextually unrelated data chunks; stay 10% below the identified limit to avoid data loss.
- Brainstorming KPIs with GPT: After extracting metadata, use GPT to brainstorm and create meaningful KPIs for visualization.
- Streamlined Data Visualization: Utilize GPT to write Python code to create graphs, keeping analysis streamlined and version-controlled within one environment instead of using a separate visualization tool.
Summary
Have you ever faced off against large volumes of unkempt and free form data entered by human beings and tried to make sense of it? It is an extremely brain-numbing and time consuming job, and unless you have dedicated time to pour over it, chances are that you have ended up just sampling the data, and walked away with surface insights likely using untrustworthy metadata. Typically not great mileage.
It is not hard to see how large language models, which specialize in making sense of chaotic data, can help here. This article walks best practices gleaned from such an implementation, covering a range of concepts such as the most efficient method of using GPT to help you clean the data, do the analysis, and create useful graphs, approaches on managing Personally Identifiable Information (PII), a production-hardened prompt design, working around GPT's ‘prefrontal cortex' bottleneck and more!
But before all that, I'll start with sharing how this experience completely changed my own strongly-held opinion around Data Quality:
I used to believe that in order to improve data quality, you must fix it at the source, i.e. the Systems of Engagement. For example, I used to believe that for a Sales CRM, we must ensure Sales and Marketing teams are entering quality data and metadata in the beginning. Similarly for customer support, we have to ensure the Customer Support Engineers are selecting all the right metadata (ticket cause code, customer impact, etc) associated with the ticket at the inception, duration and closure of the ticket.
After my recent experiences, these beliefs have been smashed to bits. You can absolutely have unruly data at the source, and with the right direction, Large Language Models (LLMs) can still make sense of it resulting in meaningful insights!
No need to clean data at source: Like a water filter, you just plug an LLM in the middle of the stream and purify it!

Longer term, having processes in place to populate accurate metadata at source definitely help, though keep in mind they are are time consuming coordinate and audit.
Operating principles
In order to conduct this analysis, I had two simple principles:
- Avoid disrupting my team's current delivery: While it would have been easier for me to request someone in my team to do the analysis, it would have disrupted the team's velocity on already ongoing projects. I had to figure out how to do all of the analysis myself, while doing my day job as a product development executive.
- Use Generative AI for everything: Large Language Models are great in data manipulation and, specifically for this use case, extracting value out of messy data. They are also much better than I am in coding. It's just easier to tell someone to do things and inspect, than to get in the zone and do the work. This way, you can make a dent even with part-time effort.
Using GPT to write the analysis code too!
Bottom-line up front: Getting data extracted, modified and visualized requires intermediate level Python coding, but now, GPT can do all that for you much faster, if not with higher quality. Use it!
In the following picture, I illustrate all the various steps (in green font) for which code was needed to be written to transform the data and then call the GPT API to extract insights from the ticket details. The best part is that I didn't have to write this code from scratch. I used GPT to actually write it for me!

How I did the coding
While I am reasonably decent with Python, using GPT to write code makes me at least 3x better. I used a very rudimentary method in writing code through GPT: I didn't use it to execute any code. I just told GPT what the data looked like and asked it to write code for me. I asked GPT to liberally insert print statements to print out variables at different points in the code. Then I copied that code in a Jupyter Notebook on my laptop and executed it there. For example, my prompt would be something like:
Me: Here are all the files I will use in my analysis. I'll enumerate them and call them by their number in the prompt.
1. "All Interacted Tickets.xlsx"
2. "Copy of Ticket Dump – Received from Ops.xlsx"
_3. "verifying_accurate_list_of_ops_people_supporting_myspace.xlsx"
They are all in the ../data/ directory.
Write python code to pull in files 1, 2, and 3 into pandas dataframes. Ignore all worksheets in any file which have the word pivot in them but pull in the data for the rest. Name the dataframes in snake case using the text in each of the worksheet itself in each of the excel files….
Then, GPT would spit out code, such as:
import pandas as pd
import os
# Define the file paths
files = {
1. "All Interacted Tickets.xlsx",
2. "Copy of Ticket Dump - Received from Ops.xlsx",
3. "Verifying_accurate_list_of_ops_people_supporting_my_space.xlsx"
}
def snake_case(s):
return s.lower().replace(" ", "_")
def read_excel_sheets(file_path, ignore_pivot=True):
xl = pd.ExcelFile(file_path)
sheet_names = xl.sheet_names
....
I would then take this code and run it locally. If there was an error, I would paste the output (including the print statement output) into the "same" chat as it preserves memory, and it was ‘mostly' able to fix my issues. However, in some cases, GPT would get stuck (which you can recognize when it keeps recommending the same solution to an issue), I would start to interrogate it with more questions:
Me: what is df = df[1:] doing
GPT: The line df = df[1:] is used to drop the first row of the dataframe, which is typically the header row when the Excel file is read….
Me: df = df[1:] is incorrect, I don't want you to drop the first row. this is in fact the row you want to use for the header of each dataframe.
Therefore, if you develop code using GPT out-of-band like I did, a moderate knowledge of Python is useful to break through some code issues with GPT since it is pretty much blind to the context.
Note that if you use multi-agent frameworks, there is a chance that the agents would bounce the code off of each other and resolve these defects automatically. In a future post, I will be showing my local environment setup for data engineering and analytics which shows how to set up this multi-agent framework on your laptop. Please let me know in the comments if this would be of interest.
Step by Step approach on operational ticket analysis
I came up with the following steps after several iterations and ‘missteps'! In other words, if I had to redo this analysis all over again, I would follow the following structure to streamline the process. So, I present this to you so you can reap the benefits. You're welcome!
Step 1: Filter out relevant tickets
Bottom-line up front: If metadata is unreliable, then filtering tickets related to your domain based on the support engineers who worked them is your best option.

(You only need this step if you work in a medium to large organization and are one of many teams which leverage a shared Operations team)
Reducing the working set of tickets to what is pertinent to just your department or team is an important filtering step that must be taken when you have a significant number of operational tickets being worked on in your firm. You will be sending these tickets through LLMs, and if you're using a paid service like GPT4, you want to only be sending what is relevant to you!
However, deducing the working set of tickets is a problem when you have poor metadata. The support engineers may not have been instructed to mark which teams the tickets belonged to, or did not have good ticket categories to select from, so all you have to work with is some free form data and some basic "facts" that automatically got collected for these tickets. These facts range from who created the ticket, who owned it, timestamps associated with ticket creation, state change (if you're lucky) , and ticket closure. There is other "subjective" data that likely exists as well, such as ticket priority. It's fine to collect it, but these can be inaccurate as ticket creators tend to make everything they open as "urgent" or "high priority". In my experience deriving the actual priority through LLMs is often more neutral thought it that still can be error-prone, as covered later.
So, in other words, stick to the "facts".
Amongst the "facts" that typically help you reduce the working set are the names of the support engineers that created and/or worked the ticket. Since support engineers also specialize in specific domains (data technologies vs CRM vs workday etc) the first step to take is to work with the support managers and identify the names of all the support engineers who work on the tickets associated in your domain.
Then, using an identifiable key which, such as their work email address, you can filter the morass of tickets down to the subset germane to your department and pull down the "fact" metadata associated with those tickets.
Completing this step also gives you your first statistic: How many tickets are getting opened for my space over a period of time.
Step 2: Extracting the "description" field and other metadata
Bottom-line up front: While a ticket creator can get much metadata wrong, she can't afford to mess up the description field because that is the one way she can communicate to the support team her issue and its business impact. This is perfect, as making sense of free flow data is GPT's specialty. Therefore, focus on extracting the description field and other factual "hard to mess up data" like ticket start and end time etc.

Most ticketing systems like Jira Service Management, Zendesk, Service Now etc allow you to download ticket metadata, including the long, multi-line description field. (I wasn't as lucky with the homegrown system we use at my work). However, almost all of them have a maximum number of tickets that can be downloaded at one time. A more automated way, and the route I took, was to extract this data using an API. In this case, you need to have the curated set of tickets that were worked on by the support engineers supporting your teams from Step1, and then loop over each ticket, calling the API to pull down its metadata.
Some other systems allow you to issue SQL (or SOQL in case of Salesforce products) queries through an ODBC-like interface which is cool because you can combine step 1 and step 2 together in one go using the WHERE clause. Here's an example pseudo-code:
SELECT ticket_number, ticket_description, ticket_creation_date, blah blah
FROM ticket_table
WHERE ticket_owners include "[email protected], [email protected]" ...
You get the idea…
Save this data in MS-Excel format and store it on disk.
Why MS-Excel? I like to "serialize" tabular data into MS-Excel format as that removes any issues with escaping or recurring delimiters when pulling this data into Python code. The Excel format encodes each data point into its own "cell" and there are no parsing errors and no column misalignment due to special characters / delimiters buried inside text. Further, when pulling this data into Python, I can use Pandas (a popular tabular data manipulation library) to pull the Excel data into a dataframe using its simple excel import option
Step 3: Converting data into a GPT-friendly format (JSON)
Bottom-line up front: JSON is human readable, machine readable, error-safe, easily troubleshot, and easily manipulated with the least error by GPT. Further, as you enrich your data you can keep hydrating the same JSON structure with new fields. It's beautiful!
"16220417": {
"description": "Hi Team, nThe FACT_XYZ_TABLE has not refreshed in time. Typically the data is available at 10am PST, but when I see the job, it has been completing several hours late consistently for the last few weeks. Also, it is 11am, and I see that it is still in running state!nn It is critical that this table is completed in time, so we have enough time to prepare an important sales executive report by 8am every morning. Please treat this with urgency.",
"opened_priority": "P2 - Urgent",
"origin": "Helpdesk ticket portal",
"closedDate": "2024-02-26T07:04:54.000Z",
"createdDate": "2024-02-01T09:03:43.000Z",
"ownerName": "Tom Cruise (Support Engineer)",
"ownerEmail": "[email protected]",
"contactName": "Shoddy Joe (Stakeholder)",
"contactEmail": "[email protected]",
"createdByName": "Shoddy Joe (Stakeholder)",
"createdByEmail": "[email protected]",
"ticket_status": "closed"
},
The above snippet shows a sample JSON-ified ticket metadata with ticket number as key, pointing to an object containing further key/value metadata. There would be lots of these types of JSON blocks in the file, one for each ticket.
After some hit and trial iterations, I realized the most efficient way for GPT to write data processing code for me was to convert my data into a json format and share this format with GPT to operate on. There is nothing wrong with shoving this data into a pandas data frame, and it may even be easier to do that step to efficiently process, clean and transform this data. The big reason why I have landed on eventually converting the final data set into JSON is because sending tabular data into a GPT prompt is kludgy. It is hard to read for humans and also introduces errors for the LLM as explained below.
When you're introducing tables into a prompt, it has to be done through a comma-separated-value (csv) format. There are two problems with that
- Since there can be commas inside the text as well, you have to further escape those commas, by putting the text inside double quotes (for example, "text one", "text, two", "test "hi!"" . That introduces another problem:
- what if you have double quotes (") inside that text block. Now you have to further escape those double quotes. Matching separating these values into separate columns invariably brings issues.
And yes, while you have to escape double within JSON too (eg "key": "value has "quotes"") , there are absolutely no issues in aligning this value to a column since the "key" uniquely identifies that. The column alignment can go off in some edge cases in a csv format, and then it becomes very hard to troubleshoot what went wrong.
Another reason for using JSON is that you can cleanly see and differentiate when you augment your metadata through GPT in future steps; it just adds more key value values horizontally down. You could do that in a table too, but that mostly requires a scroll towards the right in your IDE or notebook.
Pro-tip: In a future step, you will be sending this data into GPT, and will ask it to return multiple fields separated by a delimiter, such as "|". Therefore, this is a good time to remove any occurrence of this delimiter from the free-form field that you are passing into the JSON format. You don't want to risk GPT sending "|" out in the field itself
Step 4: Enhancing data using simple techniques (aka Basic Feature Engineering)
Bottom-line up front: Simple mathematical analysis like, time deltas, averages, standard deviations can easily, and more cheaply, be done using basic coding, so get GPT to write code to do that and run that code locally, instead of sending GPT the data to do the math for you. Language models have been shown to make mathematical mistakes, so best to use them for what they're good for.
First, we can enhance the ticket metadata by aggregating some of the basic information in it. This is a pre-step which is better done with some simple code instead of burning GPT credits for it.
In this case, we calculate the ticket duration by subtracting CreatedTime from ClosedTime.

Step 6: The Main Entree: GPT-driven data enhancement (enhanced Feature Engineering)
Now we come to the main entree. How to use GPT to transform raw data and derive sophisticated and structured metadata from which insights can be extracted. In the world of data science, this step is called Feature Engineering.
6.1: Pre-processing: Obfuscate sensitive information (optional)
Bottom-line up front: Get GPT to use open source anonymizer libraries and develop code to anonymize the data before you send it to a public API service.

This step applies to you in case you are using openAI and not a local open source LLM where the data stays on your laptop. In a future post, I will be showing my local environment setup for data engineering and analytics which shows an open-source LLM option.
In the firm I work in, we have a safe proxy gateway both to openAI as well as internally trained LLMs,and it can mask Privately Identifiable Information (PII) and operates the Open AI within a Trusted boundary. This is convenient because I can send all internal information to this proxy and enjoy the benefits of openAI cutting models in a safe way.
However, I realize not all companies are going to have this luxury. Therefore, I'm adding an optional step here to obfuscate personally identifiable information (PII) or other sensitive data. The beautiful part of all this is that GPT knows about these libraries and can be used to write the code which obfuscates the data too!
I evaluated five libraries for this purpose, but the critical feature I was looking for was the ability to convert sensitive information to anonymous data, and then be able to re-convert it back as well. I found only the following libraries which have this capability.
- Microsoft Presidio [link] (uses the concept of entity mappings)
- Gretel synthetics [link] (uses the concept of "Tokenizer)
Out of these two, Presidio was my favorite. I continue to be impressed to see the amount of "high quality" open source contributions Microsoft has made over the last decade. This set of python libraries is no different. It has the capabilities of identifying PII type data out of the box, and to customize and specify other data which needs to be anonymized.
Here's an example:
original text:
('Peter gave his book to Heidi which later gave it to Nicole.
Peter lives in London and Nicole lives in Tashkent.')
Anonymized test:
' gave his book to which later gave it to .
lives in and lives in .`
This can be sent to GPT for analysis. When it returns the results, you run that through the mapping to de-anonymize it:
Entity mappings
{ 'LOCATION': {'London': '', 'Tashkent': ''},
'PERSON': { 'Heidi': '',
'Nicole': '',
'Peter': ''}
}
Using Entity mappings the text can be de-anonymized:
de-anonymized text:
('Peter gave his book to Heidi which later gave it to Nicole.
Peter lives in London and Nicole lives in Tashkent.')
I recommend checking out this notebook, which walks you on how to implement this approach.
Note that apart from PII, other information that may need to be obfuscated is systems information (IP addresses, DNS names etc) and database details like (names, schemas etc)
Now that we have a mechanism to anonymize sensitive data, the next step was to create a high quality prompt to run on this data.
6.2 Pre-processing: Sanitize the input data
Bottom-line up front: Be thoughtful in choosing an output delimiter, as certain special characters hold "meaning" in language models. Then, you can feel secure in sanitizing the raw input by removing the delimiter you chose.
Problem: When asking a text based interface, like an LLM, to return tabular data, you have to tell it to output the data separated by delimiters (e.g. csv, or tsv format). Suppose you ask GPT to output the summarized data (aka "features") in comma separated values. The challenge is that the input ticket data is raw and unpredictable, and someone could have used commas in their description. This technically should not have been a problem since GPT would have transformed this data and thrown out the commas coming into it, but there was still a risk that GPT could use part of the raw data (which included commas) in its output, say in the one-liner summary. The experienced data engineering folks have probably caught on to the problem by now. When your data values themselves contain the delimiter that is supposed to separate them, you can have all sorts of processing issues.
Some may ask: Why don't you escape all these by encapsulating the value in double quotes. E.g.
"key" : "this, is the value, with all these characters !#@$| escaped" .
Here's the issue with that. The user could have input double quotes in their data too!
"key" : "this is a "value" with double quotes, and it is a problem!"
Yes, there are ways in solving this issue too, like using multi line regular expressions, but they make your code complicated, and make it harder for GPT to fix defects. So the easiest way to handle this was to choose an output delimiter, which would have the least impact in losing data context if scrubbed from the input, and then scrub it out of the input data!
I also played around with delimiters that would sure shot not be in the input data like |%|, but I quickly realized that these ate up the output token limits fast, so this was out.
Here are a few delimiters I tested

In the end, I ended up selecting the pipe "|" delimiter as this is not something most stakeholders used when expressing their issues in the ticket description.
After this, I got GPT to write some extra code to sanitize each ticket's description by removing "|" from the text.
6.3 – Prompt Performance tuning
Bottom-line up front: Before running the GPT data analysis prompt, evaluate its performance against a set of ticket descriptions with known output, fine tune the prompt and iterate until you are getting the maximum performance scores.

Goal: To have GPT read the ticket description written by the customer and just from that, derive the following metadata which can then be aggregated and visualized later:
- Descriptive title summarizing the issue
- Business Impact*
- Ticket Severity*
- Ticket Complexity
- Impacted stakeholder group
- Owning team
- Ticket category
* based on impact and urgency if provided by customer
Approach: The way I worked on sharpening the main prompt was to
- sample a few control tickets,
- manually classify each of them the same way I wanted GPT to do them (by Category, Complexity, Stakeholder (Customer) group etc),
- run these control tickets through a designed prompt GPT,
- cross-compare GPT results against my own manual classification,
- score the performance of GPT's classification against each dimension, and
- improve the GPT prompt based on whichever dimension scored lower in order to improve it
This gave me important feedback which helped me sharpen my GPT prompt to get better and better scores against each dimension. For the final prompt, check out Appendix: The GPT prompt to process ticket descriptions.
Results:
Here are the details around this metadata derived from raw ticket description, and the overall performance scores after multiple iterations of fine-tuning the prompt:

Here's my rationale on why certain dimensions scored low despite multiple turns:
- Complexity: I did run into a challenge when scoring "Complexity" for each ticket, where GPT scored the complexity of a ticket much higher than it was, based on its description. When I told it to score more aggressively, the pendulum swung the other direction, and, like a dog trying to please its owner, it started to score complexity much lower, so it was unreliable. I suspect the out-of-the-box behavior of scoring complexity higher than it is supposed to be is because of the current state of the art GPT capabilities. I used GPT4, which is considered to be a smart high school student, so naturally a highschool student would score this complexity higher. I suspect that future versions of these frontier models would bring college level and then phD level abilities, and we would be able to more accurately measure the complexity of such tasks. Alternatively, to improve even GPT4 complexity scoring analysis, I could have used the "few-shot" learning technique here to give some examples of complexity which may have improved the performance score for this dimension.
- Severity: While I asked GPT to use the impact vs urgency matrix to score severity, GPT had to rely on whatever the stakeholder had provided in the ticket description, which could be misleading. We are all guilty of using words designed to provoke faster action, when we open internal tickets with IT. Further, the stakeholder didn't even provide any impact detail in the ticket description in a non-trivial amount of cases, which lead GPT to select an erroneous severity as well.
Despite some metadata dimensions scoring low, I was pleased with the overall output. GPT was scoring high in some critical metadata like title, and category, and I could run with that.
The prompt was in good shape, but I was about to run into an interesting GPT limitation, its "forgetfulness".
6.4 – Figuring out the limits of GPTs forgetfulness
Bottom-line up front: When sending in contextually unrelated chunks of data (such as many ticket descriptions) into a GPT prompt, the upper processing limit can be much less than what you get by stuffing the maximum chunks allowed by input token limit. (In my case this upper limit ranged between 20 to 30). GPT was observed to consistently forget or ignore processing beyond this limit. Identify this through hit and trial, stick to a number 10% below that limit to avoid data loss.

Humans can keep 5–7 unrelated things in our prefrontal cortex, and it turns out GPT can keep 30–40 unrelated things, no matter how big its context window. I was only really sending the ticket number and description. The rest of the data did not require any fancy inference.
Since I had almost 3000 tickets for GPT to review, my original inclination was to try to maximize my round trip runs and "pack" as many case descriptions I could into each prompt. I came up with an elaborate methodology to identify average token size based on the number of words (as token is a sub-word, in the transformer architecture), and saw that I could fit around 130 case descriptions in each prompt.
But then I started seeing a weird phenomena. No matter how many ticket descriptions I sent into GPT to process, it consistently only processed just the first 20 to 30 tickets! GPT appeared to not have the capacity to handle more than this magic number.

This made me change my strategy and I decided to decrease the ticket batch size to maximum 10–12 tickets for each API call, based on the word count for that chunk, a little below the 20–30 upper limit. While this approach certainly increased the number of calls, and therefore prolonged the time for the analysis, it ensured that no tickets got dropped for processing.
*Total tickets chunked: 3012*
*The full ticket data has been chunked into 309 chunks:*
*Chunk 0: Number of words = 674*
*Chunk 0: Number of tickets = 12*
*Chunk 1: Number of words = 661*
*Chunk 1: Number of tickets = 12*
*Chunk 2: Number of words = 652*
*Chunk 2: Number of tickets = 12*
*Chunk 3: Number of words = 638*
*Chunk 3: Number of tickets = 7*
*Chunk 4: Number of words = 654*
*....*
When reviewing this with an AI architect in my firm, he did mention that this is a recently observed phenomena in GPT. The large input contexts only work well when you have contextually related data being fed in. It does break down when you are feeding disparate chunks of information into GPT and asking it to process completely unrelated pieces of data in one go. This is exactly what I observed.
With an optimal ticket batch size of 10–12 tickets identified and a performant prompt created, it was time to run all the batches through the prompt..
6.5 Show time! Running all the tickets through GPT
Bottom-line up front: GPT can analyze tickets in hours when the same amount can take weeks or months by humans. Also it's extraordinarily cheaper, though there is an error rate associated with GPT.
I provided GPT with the JSON format to write me code which did the following:
- Load the JSON data into a dictionary
- Iterate 10–12 tickets at a time, concatenating the GPT analysis prompt with these tickets into the FULL GPT prompt, separating each ticket/description tuple by ###
- Sending the full prompt to the GPT API (For work, I called a safer internal wrapper of this same API that my firm has built, which has security and privacy embedded into it, but by using the obfuscator step earlier, you can just as safely use the external GPT API.)
- Save the output, which came out as a pipe-separated format by concatenating that into a file on disk.
- Running the de-anonymizer, if obfuscation was done earlier. (I didn't need to write this step due to the internal GPT wrapper API my firm has built)
- Convert the output into the original JSON file as well.
- Save the JSON file on disk after the full run is completed*.
- Print some visible queues on how many tickets had been processed
- Time some states for each API call around text processed, number of tickets, start and end time.
Why saving to disk after a good run is pragmatic: These are costly runs, from a time perspective more than a money perspective. So after a successful run is completed, it is wise to serialize (save) this data on this disk, so that future analysis can be run on saved data and this code block in the Jupyter notebook doesn't have to be repeated. In fact, after a successful run, I commented out the whole code block within my notebook, so that if I ran the full notebook start to finish, it would just skip this expensive step again and instead load the JSON data from disk into memory and continue on with the analysis.
Here's a sample output of the fully hydrated JSON. The blue entries were metadata that GPT extracted from the description field

I ran about 3000 tickets through this cycle, and it completed in about 2.95 hours.