Topic Modelling Your Personal Data

Author:Murphy | View: 21304 | Time: 2025-03-22 20:24:29

Image created by author using ChatGPT 4o and DALL-E-3

In a prior article, I described how you can access your personal data that is stored and used by the front-line, consumer-facing companies that we engage with every day. These include retailers, social media, cell providers, financial service firms, and many others. I explored how to use various machine learning models and visualizations to discover how those companies perceive you.

In the process of working on that article, I discovered that the front-line firms frequently share our personal data with another set of companies generally known as data brokers or data aggregators (hereafter referred to aggregators). Aggregators enhance our data with other types of data sourced from public records, other aggregators, and similar sources to create profiles of us. They then sell the profiles back to consumer-facing companies and other organizations for marketing or other purposes.

My curiosity was aroused: Just what types of data do these aggregators keep about me? How many features do they store? Are there major types of data that individual aggregators focus on? And if there are, what does that tell me about their end-customers? What industries are those end-customers in, and what personal data do they find most valuable? I decided to find out.

I submitted personal data requests to three companies in the aggregators/broker business: Acxiom, Epsilon and Oracle. Here is a summary of the number of data features that each of them sent back to me (note that all images are by the author unless otherwise indicated):

*Oracle's data is specific to the device and browser you are using. Different combinations of devices/browsers will produce results. Oracle also provides a separate way to request "offline" data, not sourced from you device and browser (it is a very small amount)

Some of the data features from the aggregators were duplicates, appearing in two or all three of the aggregators' data sets. Most were not, however.

As you can see, I received an amazing number of data features about me from just these three vendors. It was more data than I wanted to analyze manually, so I decided to try topic modelling tools to extract the major subject areas focused on by each vendor.

Please note that I do not discuss the potential security risks, privacy risks, or ethical considerations of companies having this amount and type of personal information about me (and probably about you too!) Nor do I cover ways to delete your data and stop its collection. Finally, I don't venture an opinion on the value I get in return, if any, from the companies' use of my data. Those may be topics for future articles. Check back in six months!

My goal here is simply to explore techniques for extracting and modelling the data that the aggregators sent me. From that, I hope to learn something about the aggregators' business.

A Note on Data Pre-processing and Cleaning

All three vendors sent my data in the form of pdf files. Extracting data from pdfs can be challenging. Each pdf extraction tool has its strengths and weaknesses. For this analysis, I needed a tool that could extract data from tables in the pdf documents, and some of those tables spanned numerous pages. For this purpose, Camelot worked best for me. You can view the code I used to extract the table data from the pdfs of all three vendors in the camelot_pdf_reads Jupyter notebook at the Github repo for this article.

Here is a sample of the extracted table data in Pandas dataframe format:

Sample of data from the Epsilon pdf file

As you likely guess from looking the the above sample, each vendor's data posed unique cleansing and preparation challenges that had to be addressed in order to make it usable for the topic modelling steps. Acxiom, for example, keeps multiple instances of the same data attribute:

Sample of data for one data attribute that Acxiom stores about me. Acxiom actually stores 29 different versions of my name.

For my topic analysis, I was interested only in the unique attributes stored by each vendor, not in the different values that a vendor might assign to a given attribute. Accordingly, I dropped all of Acxiom's duplicate observations, keeping only the observation with the highest number (number (029) for the Name element in my data).

As you may notice above, Acxiom lists their data features with hierarchy-like labels (01. Personal Identifiers → 01. Name → Name…) The prefixes (01. Personal Identifiers →…) appear many times, even after duplicate attribute entries are removed. I felt that leaving the repetitious prefixes in the data could unduly influence my topic modelling, weighting these phrases too heavily. So, I removed those prefixes as appropriate before creating models.

Epsilon presented different data prep issues. A few empty records were found, caused by stray newline characters in preceding rows. Also, some rows contained abbreviations that needed expanding (for example, expanding "CMV" to "Current Market Value"). Some abbreviations were deleted if meaning or purpose was not obvious (for example, "MT -", which appeared to be an Epsilon internal code of some sort.)

As discussed more fully in the Epsilon discussion below, the vast majority of their inferences focused on household data at the US Zip Code or US census geographic levels. As a result, most element descriptions were preceded with "Zip", "Zip2", "Census" or similar qualifiers.

Examples of Epsilon observations for the same data feature, just at different geographic levels

To focus on unique features in my topic models, I stripped the geographic qualifiers from the feature descriptions and then dropped the resulting duplicate feature entries.

Like Acxiom, Oracle's data was presented in a hierarchy structure. The highest level in the hierarchy represented the source from which Oracle procured the data (either from Oracle's own internal sources or from one of Oracle's data partners). Unlike Acxiom, Oracle's hierarchies sometimes went several levels deep:

Examples of Oracle features at multiple levels of data hierarchy

My plan was to simply drop higher-level qualifiers to avoid having their repeating values flood the topic models. Also, at the other end of the spectrum, I decided to drop the lowest level descriptors from those data elements with several levels of hierarchy (some had 17 levels!) In doing so, my assumption was that the lowest, most detailed levels would probably occur infrequently and be relegated to low-relevancy topics by the topic models.

With that plan in mind, the challenge became identifying the appropriate intermediate levels to use across all 8,200+ of Oracle's data records to capture enough meaning to generate usable topic models. After some exploratory data analysis and manual review, it turned out that using the third and fourth levels of hierarchy for each record provided enough differentiation for the models to yield meaningful topics.

Sample of Oracle records with just the 3rd and 4th levels of hierarchy (the first level is "level_0")

To keep the focus of this article on topic modelling, I omit the code for the EDA and data cleansing steps I performed for each aggregator's data. Instead, I focus on the code and results for the various topic models I used. You can find the code I used for EDA and data cleansing at the repo for this article, with a separate Jupyter notebook for each vendor's data.

Modelling Approach

If you search for examples of topic modelling, the examples you see often focus on longer-form text such as paragraphs or short-form text such as X tweets. Unlike those types of datasets, the data elements I was trying to analyze here did not contain full sentences nor phrases. They were strings of words used to label data. They contained jargon and some industry-specific terms. One of my goals was to see how topic modelling performs on this unique data.

I share here the results of trying multiple different topic modelling tools on my data including traditional models such as LDA, NMF, and TF-IDF and newer transformer models such as Bertopic. As with my prior article, no permission was needed for me to use the data as it is my personal data.

I relied heavily on the article by Nicolo Cosimo Albanese for guidance on models to try on my data from the three vendors. It does a superb job of discussing the pros and cons of each type of model for different use cases, and its code examples are excellent.

I limited the data used for topic modelling to the extensive inferences data elements that these three aggregators store in their profiles of me. I exclude the non-inference data elements that each vendor believes to be ground truth. The non-inference data elements often relate to personal identifiers and characteristics such as name, phone numbers, addresses, e-mail addresses, race, ethnicity, education, religion, and income/wealth indicators:

Counts of inference versus non-inference data elements from Acxiom's data about me

While interesting, the non-inference data elements are far outnumbered by the inference data elements in each aggregator's data. I felt the inference elements paint a better picture of the things that are most interesting to the aggregators and their end-customer base.

Throughout this article, I display only the data element descriptions from each aggregator. To preserve my own personal privacy, I do not show the values that the aggregators assign to me for each element. Though I did not do a thorough analysis of the accuracy of those assigned values, my rough estimate is that accuracy exceeds 80% across non-inference and inference data elements.

Acxiom

My visual review of the cleaned and pre-processed Acxiom data revealed that most of the observations related to automobiles and other vehicles. Yet the remaining, non-auto -related observations were still interesting. They covered things like inferences about my financial behavior. I did not want those topics to be overwhelmed by the vehicle-related inferences in my topic model. So, I created two separate topic models, one for the vehicle inferences and one for the non-vehicle inferences.

Acxiom Vehicle-related Inferences

As a first step, I tried a Non-Negative Matrix Factorization model with Acxiom's vehicle-related inferences. Like other traditional models, NMF requires some additional pre-processing of the text to assure good results. Specifically, common noise words ("stopwords") that don't contribute to the topic model must be removed. Punctuation must also be removed.

You also need to specify a number range ("n-gram range") for the number of adjacent word phrases to use in building the model. Here, I used groups of three to five words, and I used the scikit-learn CountVectorizer object to find the 3–5 -gram phrases occurring most frequently in the vehicle-related inference data.

# topic analysis code and comments borrowed from https://www.anaconda.com/blog/introduction-to-text-analysis-with-python-in-excel
# some preprocessing steps modelled after https://towardsdatascience.com/elegant-text-pre-processing-with-nltk-in-sklearn-pipeline-d6fe18b91eb8

# code assumes the list of phrases to be analyzed is in a pandas series
# called veh_inf_no_dups_veh (see the Git repo mentioned earlier in this article
# for the code used to build this series)

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
from sklearn.feature_extraction.text import CountVectorizer 

# set range for n-grams to be created from the text
ngram_range = (3,5)

# set options and format text for graph of topic model based on n-gram range
color_spec='c'
if ngram_range[0] == 2:
    ngram_txt_1 = 'Bi-grams'
elif ngram_range[0] == 3:
    ngram_txt_1 = 'Tri-grams'
else:
    ngram_txt_1 = f'{ngram_range[0]}-grams'

if ngram_range[1] == 3:
    ngram_txt_2 = 'Tri-grams'
else:
    ngram_txt_2 = f'{ngram_range[1]}-grams'

plot_title_text = 'Inferences - Collected/Processed Datan' + 
                  'Vehicle-related Featuresn' + 
                   ngram_txt_1 + ' - ' + ngram_txt_2

# find and display the highest occurring n-grams in the n-gram range
# Create CountVectorizer object
# CountVectorizer converts a collection of text documents to a matrix of token counts
c_vec = CountVectorizer(stop_words=stopwords_list, ngram_range=ngram_range)
# Fit the CountVectorizer to the data to get a matrix of ngrams
ngrams = c_vec.fit_transform(veh_inf_no_dups_veh)
# Count the frequency of ngrams
count_values = ngrams.toarray().sum(axis=0)
# Get a list of ngrams
vocab = c_vec.vocabulary_
# Create a DataFrame to store the frequency and n-gram, sorted in descending order of frequency
df_ngram = pd.DataFrame(sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True)
            ).rename(columns={0: 'frequency', 1:ngram_txt_1 + ' - ' + ngram_txt_2})
# Display the top 20 n-grams by frequency
df_ngram.head(n=20)

Highest occurring 3, 4, and 5 -grams in vehicle inferences topics

Next I used the TfidfVectorizer object with the NMF object to build and visualize the topic model:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import make_pipeline

# TF-IDF (Term Frequency-Inverse Document Frequency) is a technique used to quantify a word in documents
# It is used to reflect how important a word is to a document in a collection, 
# or corpus, of documents

# Create a TF-IDF vectorizer object
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords_list, ngram_range=ngram_range)

# Create an NMF (Non-Negative Matrix Factorization) object, allowing 
# us to group major topics and sub-topics
# The n_components parameter is used to specify the number 
# of major topics to extract

nmf = NMF(n_components=3)

# Create a pipeline object that sequentially applies the TF-IDF vectorizer and NMF
pipe = make_pipeline(tfidf_vectorizer, nmf)

# Fit the pipeline to the data
pipe.fit(veh_inf_no_dups_veh)

# set options to allow for wrapping the sub-topic labels next to the bars 
# in the graph

# text wrap the topic labels for use in the bar chart
# text wrapping technique from: https://medium.com/dunder-data/automatically-wrap-graph-labels-in-matplotlib-and-seaborn-a48740bc9ce
import textwrap
def wrap_labels(ax, width, break_long_words=False):
    labels = []
    for label in ax.get_yticklabels():
        text = label.get_text()
        labels.append(textwrap.fill(text, width=width,
                      break_long_words=break_long_words))
    ax.set_yticklabels(labels, rotation=0)

def plot_top_words(model, feature_names, n_top_words, title):
    """
    Plot top words in topics 
    source: https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py
    """
    import matplotlib.pyplot as plt
    fig, axes = plt.subplots(1, 3, figsize=(30, 15), sharex=True)
    axes = axes.flatten()
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[: -n_top_words - 1 : -1]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = topic[top_features_ind]
        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.6, color=color_spec)
        ax.set_title(f"Topic {topic_idx +1}", fontdict={"fontsize": 30})
        ax.invert_yaxis()
        ax.tick_params(axis="both", which="major", labelsize=20)
        wrap_labels(ax, 15)
        for i in "top right left".split():
            ax.spines[i].set_visible(False)
        fig.suptitle(title, fontsize=40)
    plt.subplots_adjust(top=0.8, bottom=0.05, wspace=0.90, hspace=0.3)
    plt.show()

# Plot the top words in the topics identified by the NMF model
plot_top_words(
    nmf, tfidf_vectorizer.get_feature_names_out(), 7, plot_title_text
)

Topic model of vehicle-related inferences

From this I see that the first major topic ("Topic 1") relates to inferences about my "rank" as an owner of various vehicle models. I assume "rank" is an Acxiom-derived measure of how likely it is that I own particular models.

The second major topic covers inferences related to new vehicles for which I may be in the market. The third major topic relates to inferences about my rank relative to a vehicle make (as opposed to vehicle model in Topic 1).

I assumed "make" described a type of car (sedan, RV, SUV, etc.) However a look at sub-topics further down in the weighted list for Topic 3 showed specific manufacturers (for example, Nissan and Toyota), not vehicle types. Manufacturers also appeared in the lower sub-topics of Topic 1. It was not clear to me how Acxiom differentiates vehicle "models" versus "makes" versus manufacturers.

As an experiment, I tried running the topic model with different n-gram range values. Results changed, but not drastically. For example, here are the highest occurring n-grams and the topic model for a bi-gram-to-tri-gram range:

Highest occurring bi-grams and tri-grams in vehicle inference topics

Major topics and their sub-topics – vehicle inferences topics

The earlier tri-gram-to-5-gram range model allowed for a bit more specificity in sub-topics compared to the bi-gram-tri-gram results above. Car maker names (BMW, Chevrolet) appeared more frequently in the tri-gram-to-5-gram model.

Interestingly, the third major topic, Topic 3, changed when the different n-gram ranges were used. My rank relative to a vehicle make in Topic 3 of the tri-gram-to-5-gram is replaced with my loyalty to vehicle market segments in the bi-gram-to-tri-gram model.

The lesson I took from this is that it pays to try various combinations of n-gram ranges when doing topic summarization on a large body of text. You are likely to get varying results that can lead to useful insights and alternative paths to explore in your research.

Acxiom Inferences – Non-vehicle

Moving on to Acxiom's non-vehicle inferences about me, I looked at the results for a bi-gram-to-tri-gram range model:

Non-vehicle inference most frequent terms – bi-gram-to-tri-gram model

Top three topics and their sub-topics – non-vehicle inferences

The first major topic group in this model focused on individual and household buying behavior and spend. The second major topic group covered household income and assets. And finally, topic three covered credit and credit card information.

Note that you can change the number of topics that the NMF model tries to find in your data. You are not limited to three topics. Changing the n_components parameter in the nmf = NMF(n_components=3) statement in the code adjusts the number of topics found. Experimenting with this value may give you different insights into the topics in your data.

Acxiom Summary

It is worth reiterating that the above analysis is based on the topic features that Acxiom included in its inferences data. It does not include the values Acxiom assigned to me for those features. Those values were in a separate column of the report that Acxiom sent to me. The number scales in the bar charts above represent the relative importance of each topic feature in the corpus.

Acxiom's overwhelming focus is inferences about the type of automobiles I own, prefer, or may be in the market to buy. This tells me that automobile manufacturers are a significant part of Acxiom's client and revenue base.

Epsilon

There were a total of 2,549 inference data points in Epsilon's report. My exploratory data analysis revealed that 2,428 or 95% of the inferences were related to the geographic area in which I live. Those geographic-based inferences were at the Zip Code, Census Tract, and other geographic levels and provided aggregated data about all people living in those geographies, not just data specific to me. The remaining 121 inferences were about me individually.

As a first topic model for Epsilon, I decided to continue using NMF with the same code as shown above for Acxiom. This time, however, I decided to experiment more with the hyperparameter tuning ideas I mentioned at the end of the Acxiom section. Specifically, I used different values for number of requested topics, ranging from 3 through 8. I also experimented with different n-gram ranges, from bi-gram-to-tri-gram to 4-gram-to-7-gram.

I used visual inspection of the topic results to determine which parameter combination generates the most sensible and understandable topics. As João Pedro helpfully describes in his article, there are several statistics-based metrics you can use to measure the coherence of the topics created by the model and parameter combinations you test.

My intuition was that the higher value n-gram models would give me the best results. That proved to be incorrect. The bi-gram-to-tri-gram model did a much better job of creating coherent, differentiated topic groups than did the 4-gram-to-7-gram model. This held true regardless of the number of topics requested.

Here, for example, are the topic groups created by the 4-gram-to-7-gram model when I requested 8 topics:

Topic # 0 in this list is clean, with all elements (Sub-topic Text) related to characteristics of housing units. Topics # 1 and # 2, however, contain a mix of elements. In both cases, they have elements about household income and about household truck purchases and registrations.

Topics # 3 and # 6 are both clean, but both contain elements related to the current market value of vehicles. Having those elements combined into one topic focused on current market value of vehicles would have been preferable.

In contrast, here are the topic groups created by the bi-gram-to-tri-gram model when I requested 8 topics:

By using bi-grams and tri-grams, all but one topic (#2) are single-subject topics, with all elements related to the single subject. Topic #2 covers vehicle current market value and vehicle age. That said, one could argue that vehicle age is a key factor in determining a vehicle's current market value and, hence, is a valid component of a current market value topic. (Okay, maybe that is is stretch.)

A bi-gram-to-tri-gram model clearly performs better than high-valued n-gram ranges. This held true for me regardless of the number of topics I asked the NMF model to create.

Next, I tried a Latent Dirichlet Allocation (LDA) topic model. It is a traditional model that does not maintain the semantic context of word phrases. Like NMF, it also requires you to specify a number of topics you want it to extract from your corpus, and you do need to do multiple data preparation steps on the data.

Below is the code (based on the example in Albanese's article) that I used to prepare the data, fit the model, and visualize the results using the very helpful pyLDAvis library. Note that this code uses lemmatization, versus just stemming, to create the word tokens used by the model.

Lemmatization attempts to identify and retain the correct part of speech for words that have a common stem but which can act as different parts of speech depending on context. For example, "meeting" is a noun in "I had a good meeting". But, it is a verb in "I am meeting with you". Both forms of meeting have a common stem, "meet".

'''
Code here is from Nicolo Cosimo ALbanese's article
https://medium.com/r/?url=https%3A%2F%2Ftowardsdatascience.com%2Ftopic-modeling-with-lsa-plsa-lda-nmf-bertopic-top2vec-a-comparison-5e6ce4b1e4a5

References:
[1] LDA with Gensim: https://radimrehurek.com/gensim/models/ldamodel.html
[2] Visualization with pyLDAvis: https://pypi.org/project/pyLDAvis/
'''

# Import dependencies
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
import spacy
import pyLDAvis
import pyLDAvis.gensim_models
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

def lemmatize(docs, allowed_postags = ["NOUN", "ADJ", "VERB", "ADV"]):
    '''
    Performs lemmization of input documents.

    Args:
      - docs: list of strings with input documents
      - allowed_postags: list of accepted Part of Speech (POS) types
    Output:
      - list of strings with lemmatized input
    '''
    nlp = spacy.load("en_core_web_sm", disable = ["parser", "ner"])
    lemmatized_docs = []
    for doc in docs:
        doc = nlp(doc)
        tokens = []
        for token in doc:
            if token.pos_ in allowed_postags:
                tokens.append(token.lemma_)
        lemmatized_docs.append(" ".join(tokens))
    return (lemmatized_docs)

def tokenize(docs):
    '''
    Performs tokenization of input documents.

    Args:
      - docs: list of strings with input documents
    Output:
      - list of strings with tokenized input
    '''
    tokenized_docs = []
    for doc in docs:
        tokens = gensim.utils.simple_preprocess(doc, deacc=True)
        tokenized_docs.append(tokens)
    return (tokenized_docs)

docs = pd.read_pickle(path to your data).to_list()

# Pre-process input: lemmatization and tokenization
lemmatized_docs = lemmatize(docs)
tokenized_docs = tokenize(lemmatized_docs)

# Mapping from word IDs to words
id2word = corpora.Dictionary(tokenized_docs)

# Prepare Document-Term Matrix
corpus = []
for doc in tokenized_docs:
    corpus.append(id2word.doc2bow(doc))

# Fit LDA model: See [1] for more details
topic_model = gensim.models.ldamodel.LdaModel(
    corpus = corpus,      # Document-Term Matrix
    id2word = id2word,    # Map word IDs to words
    num_topics = 30,      # Number of latent topics to extract
    random_state = 100,
    passes = 100,         # N° of passes through the corpus during training
    )

# Visualize with pyLDAvis: See [2] for more details
#pyLDAvis.enable_notebook()
visualization = pyLDAvis.gensim_models.prepare(
    topic_model, 
    corpus,
    id2word, 
    mds = "mmds", 
    R = 30)

pyLDAvis.save_html(visualization, 'LDA_viz.html')

The visualizations created by pyLDAvis are interactive, allowing you to explore the words included in each topic as well as relationships among topics. Here, for example, are the words contributing the most to topic 1 of my LDA model for the Epsilon data:

By hovering over topic 1, I see that it focuses heavily on automobile-related words ("truck", "car", "luxury", "register"). However it also contains some unrelated words ("household", "health", "card").

Topic 2 has a different focus:

This topic is cleaner, relating to personal financial matters. "trade" is an industry term for a credit card transaction. "balance", "finance", "bankcard", "installment", etc. continue the personal finance theme.

This visualization also allows us to see how topics relate to one another. For example, here is the result if I hover over the word "trade" in the word list bar chart on the right side:

Topics in which "trade" is a significant component

This tells me that "trade" plays a role in eight topics, with some topics overlapping (2,5,6,10 and 13). It seems most of the topics in the lower right quadrant of the inter-topic distance map relate to personal financial matters. Hovering over the individual topics in that quadrant allowed me to identify the personal financial matters covered by Epsilon.

Epsilon Summary

By far, Epsilon's inferences were at a geographic level rather than at an individual level. That said, their inferences provided an amazing number of insights about the behavior of me and my neighbors. I found those insights to be very accurate.

Automobile preferences tended to be a strong area of analysis for Epsilon, just as they were for Acxiom. Epsilon also included many features related to financial health, household composition, and occupation.

My parameter tuning tests showed that a bi-gram-to-tri-gram range provided the cleanest topic results from my NMF model. Though I did not use n-grams in the LDA model, adding an n-gram preprocessing step likely would have improved the performance of that model. (See this article on optimizing LDA results.)

Oracle

After pre-processing, my file from Oracle contained 7,723 inference records. No further pre-processing was required beyond that described in the Data Pre-processing section earlier.

For Oracle, I switched to a newer transformer -type model to perform topic modelling, BERTopic. It uses embedding techniques to maintain the intended meaning and context of word phrases. My intuition was this should improve the quality of topics identified by the model.

Here is the code I used:

# based of Nicolo Albanese's BERTopic code example 
# in https://towardsdatascience.com/topic-modeling-with-lsa-plsa-lda-nmf-bertopic-top2vec-a-comparison-5e6ce4b1e4a5 

import pandas as pd
import numpy as np

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from hdbscan import HDBSCAN

from contextlib import chdir

with chdir("XXXXXX"):                   # path to your pre-processed data
    docsdf = pd.read_pickle("YYYY.pkl") # data file name - mine contained a 
                                        # pandas series opject stored in pickle format
                                        # see Oracle pre-preprocessing notebook for data details

docs = docsdf['levels_2_and_3'].to_list()

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

cluster_model = HDBSCAN(min_cluster_size = 25, 
                        min_samples = 10,
                        metric = 'euclidean', 
                        cluster_selection_method = 'eom', 
                        prediction_data = True)

# BERTopic model
topic_model = BERTopic(embedding_model = embedding_model,
                       hdbscan_model = cluster_model,    
                       n_gram_range = (2,3),             
                       nr_topics = 'auto')               

# Fit the model on the corpus
topics, probs = topic_model.fit_transform(docs)

df_top_topics = topic_model.get_topic_info().set_index('Topic')[
   ['Count', 'Name', 'Representation']]

with chdir("ZZZZ") # path to which you want to save your output top topics data and visualizations

    # save top topics data
    df_top_topics.to_pickle('Oracle_top_topics_df_clus_25_minsamp_10_ngram_2_3.pkl')

    # Save intertopic distance map as HTML file
    topic_model.visualize_topics().write_html("Oracle_intertopic_dist_map_clus_25_minsamp_10_ngram_2_3.html")

    # Save topic-terms barcharts as HTML file
    topic_model.visualize_barchart(top_n_topics = 25).write_html("Oracle_barchart_clus_25_minsamp_10_ngram_2_3.html")

    # Save documents projection as HTML file
    topic_model.visualize_documents(docs).write_html("Oracle_projections_clus_25_minsamp_10_ngram_2_3.html")

    # Save topics dendrogram as HTML file
    topic_model.visualize_hierarchy().write_html("Oracle_hieararchy_clus_25_minsamp_10_ngram_2_3.html")

As BERTopic developer Maarten Grootendorst describes in his BERTopic tutorial, several hyperparameters can be tuned for BERTopic and for the helper modules used by BERTopic, HDBSCAN and UMAP. An explanation of the parameters and their usage can be found in the BERTopic Github repo.

I experimented with the n_gram_range and nr_topics parameters of BERTopic and the min_cluster_size and min_samples parameters of HDBSCAN to see which settings produced the most meaningful topic results. As in the Epsilon tests, n_gram_range determined the size of adjacent word phrases that were used in the model.

The nr_topics parameter allows you to specify a number of topics that you want BERTopic to create. Its default is https://towardsdatascience.com/topic-modelling-your-personal-data-9561e25a042e/None which causes BERTopic to determine the number of topics on its own. If you specify a number, BERTopic will combine similar topics to achieve the target number of topics. You can also specify ‘auto' which reduces the number of topics created by iteratively combining topics that meet a similarity metric.

The min_cluster_size of HDBSCAN determines the minimum number of documents to be included in a topic cluster. The min_samples parameter is set to the value of min_cluster_size by default and works with min_cluster_size to determine the number of outliers. Grootendorst suggests that setting min_samples less than min_cluster_size can reduce the number of outliers.

Traditional models such as those I used for Acxiom and Epsilon will not create outlier documents. Every document will be placed into a topic. Conversely, BERTopic does identify outliers that do not fit into any of the topics it creates. If you create too many outliers, you may miss insights from documents that could be included in topic clusters. Creating too few outliers may cause documents to be forced into unrelated topic clusters.

Here is a summary of the results from the hyperparameter tests I ran:

Oracle topic results with various BERTopic hyperparameter settings

Interestingly, the default parameter settings generated the least outliers. They did, however, create the most topics. Here is a bar chart of the top 25 topics created by the default parameter model:

Word scores for 25 topics (of 219) created by BERTopic using default model parameters

Given single words were used to create the tokens and embeddings, I was not surprised that outliers were minimized. It should be easier to find similarity among single-word topics.

A drawback of this model's results is that some topics are not as coherent and understandable as others. For example, it is hard to determine the overall theme of Topic 10 ("healthcare", "intent", "b2b", "insurance") or Topic 18 ("buyer", "employed", "employment", "donations", "intent"). Conversely, Topic 6 ("computers", "electronics", "technology", "computing", "computing", "cell") appears to focus on an individual's technology interests and Topic 8 ("garden", "home", "improvement", "decor", "diy") focuses on home improvement interests.

Also, there is clear opportunity to combine topics created by the model. Topic 20 ("financial", "retirement", "banking", "services", "balance") sounds similar enough to Topic 22 ("liquid", "assets", "investing", "finance", "investors") that the two topics could be combined.

I assumed that specifying an n-gram range would lead to more meaningful, coherent topics as it did in my Epsilon tests. As the above summary table shows, trying bigrams and trigrams along with asking BERTopic to try combining similar output topics (nr-topics = "auto") did cause the number of topics created to drop significantly. However, the number of outlier topics ballooned.

Taking Grootendorst's suggestion to set min_samples less than min_cluster_size did reduce the number of outliers. But, the number of topics increased.

I reached a happy medium by specifying a number of topics to create (40), setting min_samples less than min_cluster_size (10 and 25 respectively), and using bigrans and trigrams. Thirty-nine topics were created (topic number "-1" is the outliers). An acceptable total of 664 outliers were identified.

Here are the top 25 topics:

Word scores for 25 topics (of 39) created by BERTopic with bigrams and trigrams and reduced min_samples parameters

Judge for yourself, but I think there are fewer ambiguous topics and fewer opportunities to combine similar topics than the results from the default parameters model. Also, the extra word descriptions caused by the bigrams and trigrams give you more of a clue of the focus of each topic, aiding in understandability.

Arguably, topic 9 ("to buy", "ready to buy", "ready to", "buy travel", "to buy travel") and topic 16 ("travel intenders", "insights discretionary spend", "financial insights discretionary", "insights discretionary", "discretionary spend") each contain mixed topics. They present an opportunity to combine their mixed topic components (spending profile and interest in travel) into two, cleaner topics. But that still leaves us with two topics, not a reduction in topics.

Oracle Summary

With over 7,000 records, Oracle had by far the largest amount of data on me. It would be a daunting task to summarize that data into meaningful topics manually.

The BERTopic transformer model I used for Oracle did a very good job of identifying topics in the data, even when used with default parameters. It creates outliers, but I believe outliers help ensure the topics created are cleaner and more coherent.

As with the Epsilon tests, testing BERTopic with different hyperparameter settings led to improved results. The bigram and trigram model along with setting a specific number of desired output topics and adjusting parameters to reduce outliers produced excellent topic results.

Oracle's inferences addressed a wide range of aspects about me and my lifestyle. Automobile preferences were included as were personal finance data and technology interests. But there were plenty of other inferences about such diverse things as characteristics of my employers, my charitable giving profile, my fitness product interests, my desire to travel, and my interest in do-it-yourself home improvement projects. That tells me that Oracle has clients for its data in a wide variety of industries.

Conclusion

The motivation for my work was to get a grasp on the type of personal data focused on by the three aggregators I chose. As indicated in the summary for each vendor, a wide range of data points are stored for each individual, and some relate to people in your neighborhood and not just you. As you likely noted from looking at the output of the various topic models above, the inferences that these three aggregators store and sell to their clients are focused on helping their clients build profiles of individuals that will in turn allow the clients to be more successful at selling their goods and services to consumers. There is no real surprise in that.

One thing that did surprise me was how much one industry was represented in the vendors' data. The automotive industry apparently has a voracious appetite for data on our propensity to buy their cars and trucks. That appetite must lead to a willingness to pay a lot of money for personal inferences information, and the aggregators have responded.

Another motivation of my work was to determine if traditional and newer topic modelling tools could be used effectively to summarize topics from lists of personal data attributes, lists that are different than sets of short sentences such a X tweets and different than collections of paragraphs. My intuition was that topic identification might be less effective on personal data attributes since they are not necessarily complete thoughts and could contain a significant amount of business-specific terms and abbreviations.

As it turned out, my intuition was incorrect. The models I tested were able to successfully generate topics from the data. The effectiveness of the models varied. It became clear that testing different hyperparameter combinations is an essential step to finding the best topic model for personal data attributes, just as I am sure it is for other types of text data.

In my assessment, the transformer-based BERTopic model showed the best performance at creating coherent, meaningful topics from my personal attribute data. It was also the easiest to use. That said, my analysis was by no means a scientific one. I used different aggregator's datasets in each test of the different models.

Tags: Bertopic Data Privacy Deep Dives Topic Modeling topic-visualization