Topic Alignment for NLP Recommender Systems
in a Natural Language Processing (NLP)-based system.

Introduction
As the capabilities in Large Language Models (LLM), such as ChatGPT and Llama, continue to increase, a growing area of research revolves around adapting semantic reasoning to these systems. While these models do a great job providing responses grounded in predictions based on prior human knowledge, the issues arising with hallucinations, generic answers, as well as answers that don't fulfill the users request are still common. Recommendation systems are parallel to LLMs in how they provide recommendations based on user input. Today, we will look at further enhancements in recommendation responses when adding additional metadata of the topics within a query and how they align with the data used to create a response.
This research is important because it could ultimately lead to enhancements in the semantic depth of large language models (LLMs) by incorporating human-like abilities to infer overarching topics inherent in a body of information.
Topic Modeling Overview
Let's do a quick refresher on Topic Modeling.
Topic modeling is a machine learning technique used to identify and extract hidden themes or topics from large collections of text data.
The basic steps of Topic Modeling Include:
- Collect data (Textual in nature. A document of text is known as a corpus)
- Use general Natural Language Processing (NLP) practices to preprocess the data (Tokenize, lowercase, remove punctuation and stopwords, lemmatization, etc.)
- Word-Document Matrix Creation – This matrix's rows are represented as documents where the value of each entry is the frequency of that word within its given data point.
- Topic Identification – An algorithm like Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF) is applied to detect patterns in word usage and group frequently co-occurring words into topics.
- Document-Topic Distribution – Documents are represented as a combination of topics and weights are assigned based off the level of contribution from each topic to a given document.
- Interpretation – Topics inferred from calculated weights.

The image above illustrates the general process. We apply algorithms such as LDA or NMF to identify the underlying topics within a corpus. Using these topics, we can cluster documents based on their thematic similarity. Additionally, inference can be used to derive deeper insights by analyzing the relationships between the topics and their respective weights within each document, revealing the overarching meaning within the corpus.
Incorporating topics offers significant benefits by organizing data, uncovering hidden themes, and enhancing the relevance of information. Relevance is crucial, yet often overlooked, when systems deliver information, particularly in the case of Large Language Models (LLMs). This is due to the current limitations in semantic understanding within these systems. However, I anticipate that advancements in this area will resolve these challenges in the coming years.
Past Work
You can find the original recommendation system article I wrote here. A quick summary of how the code works
- pdfReader: Extracts text from a PDF as a single line string.
- xlsxReader: Extracts text from an Excel document as a single line string.
- csvReader: Extracts text from a CSV File as a single line string.
- pptReader: Extracts text from a powerpoint as a single line string
- wordDocReader: Extracts text from a single Word
- dataprocessor: Designed to process and analyze a collection of documents, perform text preprocessing, transform text into vectors, and recommend documents based on text similarity.
The below diagram explains the workflow of the code and how recommendations are provided to the user.

First, the user creates a database (simply just putting files in to a folder) of the information they wish to query. I recommend using information you want to archive, but may be interested to use in the future. The database is then processed using common NLP preprocessing and stored. The user then provides a query and a weight recommendation is provided to the user. The system uses a combination of term frequency, cosine similarity and distance similarity through network analysis to provide its recommendation. The weighted recommendation is as follows:

The weighted output is based on the computed metrics within the system. Based off your preferences, you can change the weights accordingly to output your tailored final recommended list.
Differences Between Old and New

The process is the same as before, expected now topics are included and similarity score is calculated from the topics uncovered in the query to the topics in the vectorized database.

These weights can be adjusted based off the preference of the user, but make sure they equal 1! Now, the output will incorporate topic information, which can help prescribe better information to the user! Here is an example output of some of the information we can obtain:
Example Output
File: 3. McCloskey, Cox, Champagne & Bihl - Benefits of using blended generative adversarial network images to augment classification model training data sets - Copy.pdf
Topic distribution: [0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]
Topic 9: collection need machine look pattern make follow point hyperparameter level
Topic 8: collection need machine look pattern make follow point hyperparameter level
Topic 7: collection need machine look pattern make follow point hyperparameter level
File: Published Research - Extracting Insight from Small Corpora.pdf
Topic distribution: [0.00522037 0.00522037 0.00522404 0.00522037 0.95301298 0.00522037
0.00522037 0.00522037 0.00522037 0.00522037]
Topic 4: review word topic chatgpt corpus research business model customer data
Topic 2: image model dataset training combine object truck car gan set
Topic 9: collection need machine look pattern make follow point hyperparameter level
File: MS_THESIS - Using Generative Adversarial Networks to Augment Unmanned Aerial.pdf
Topic distribution: [0.00619424 0.00619424 0.94424855 0.00619424 0.00619753 0.00619424
0.00619424 0.00619424 0.00619424 0.00619424]
Topic 2: image model dataset training combine object truck car gan set
Topic 4: review word topic chatgpt corpus research business model customer data
Topic 9: collection need machine look pattern make follow point hyperparameter level
Query: generative adversarial network image
Topic distribution: [0.03350991 0.03350991 0.69841058 0.03350991 0.03351012 0.03350991
0.03350991 0.03350991 0.03350991 0.03350991]
Topic 2: image model dataset training combine object truck car gan set
Topic 4: review word topic chatgpt corpus research business model customer data
Topic 9: collection need machine look pattern make follow point hyperparameter level
I uploaded different papers I have published and calculated their topics. You can see at the bottom, the system also calculated the topic of my query generative adversarial network image. This added an additional layer of semantic understanding to my system, and now we are incorporating more background information! This is critical to ensure you create a more tailored recommendation system.
Future Work
- Have a database of topics and then use these as tags for the query and documents in the Retrevial Augmented Generation (RAG) database. This can help add more stability to the process as well as semantic understanding.
- Use an LLM to create a summary of the topics from the query and the RAG database, and figure out which topics align the most.
Conclusion
Today, we reviewed how Topic Modeling can be used to help align answers provided by a recommendation engine to a user. This theory seeks to add knowledge based understanding humans use to the system, incorporating our ability to formulate specific topics about a body of information. This is important for two reasons.
- While it takes a more macro-level view and provides a metric based on information generalization, the generalization is quantifiable in its approach and is not a simple yes or no decision.
- It helps overcome issues at a microlevel, where specific words can be compared between two entities and overtime, the meaning behind a body of information is misconstrued.
- The idea of using topics for recommendation systems is extendable to LLMs, and can help us further understand how to adapt semantic meaning to these systems as well as observe how LLM architectures create meaning from abstract information. This is a growing area of research right now, and I invite you to give it a thought!
If you enjoyed today's article and want to read more, give me a follow! Please feel free to recommend topics for other posts you would like to see. Thanks for reading!
Code
#Imports
import string
import csv
from io import StringIO
from pptx import Presentation
import docx2txt
import PyPDF2
import spacy
import pandas as pd
import numpy as np
import nltk
import re
import openpyxl
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.parsing.preprocessing import STOPWORDS as SW
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix
from nltk.corpus import wordnet
import networkx as nx
from networkx.algorithms.shortest_paths import weighted
import glob
import sys
#Add system path
sys.path.append('your_path here')
#NLTK downloads for preprocessing
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
#PDF Reader Class
class pdfReader:
def __init__(self, file_path: str) -> str:
self.file_path = file_path
def PDF_one_pager(self) -> str:
"""A function returns a one line string of the
pdf.
Returns:
one_page_pdf (str): A one line string of the pdf.
"""
content = ""
p = open(self.file_path, "rb")
pdf = PyPDF2.PdfReader(p)
num_pages = len(pdf.pages)
for i in range(0, num_pages):
content += pdf.pages[i].extract_text() + "n"
content = " ".join(content.replace(u"xa0", " ").strip().split())
page_number_removal = r"d{1,3} of d{1,3}"
page_number_removal_pattern = re.compile(page_number_removal, re.IGNORECASE)
content = re.sub(page_number_removal_pattern, '',content)
return content
def pdf_reader(self) -> str:
"""A function that can read .pdf formatted files
and returns a python readable pdf.
Returns:
read_pdf: A python readable .pdf file.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfFileReader(opener)
return read_pdf
def pdf_info(self) -> dict:
"""A function which returns an information dictionary of a
pdf.
Returns:
dict(pdf_info_dict): A dictionary containing the meta
data of the object.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfFileReader(opener)
pdf_info_dict = {}
for key,value in read_pdf.documentInfo.items():
pdf_info_dict[re.sub('/',"",key)] = value
return pdf_info_dict
def pdf_dictionary(self) -> dict:
"""A function which returns a dictionary of
the object where the keys are the pages
and the text within the pages are the
values.
Returns:
dict(pdf_dict): A dictionary pages and text.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfReader(opener)
length = read_pdf.pages
pdf_dict = {}
for i in range(length):
page = read_pdf.getPage(i)
text = page.extract_text()
pdf_dict[i] = text
return pdf_dict
#Excel Reader Class
class xlsxReader:
def __init__(self, file_path: str) -> str:
self.file_path = file_path
def xlsx_text(self):
"""A function which returns a string of an
excel document.
Returns:
text(str): String of text of a document.
"""
inputExcelFile = self.file_path
text = str()
wb = openpyxl.load_workbook(inputExcelFile)
for sn in wb.sheetnames:
excelFile = pd.read_excel(inputExcelFile, engine = 'openpyxl', sheet_name = sn)
excelFile.to_csv("ResultCsvFile.csv", index = None, header=True)
with open("ResultCsvFile.csv", "r") as csvFile:
lines = csvFile.read().split(",") # "rn" if needed
for val in lines:
if val != '':
text += val + ' '
text = text.replace('ufeff', '')
text = text.replace('n', ' ')
return text
#CSV Reader Class
class csvReader:
def __init__(self, file_path: str) -> str:
self.file_path = file_path
def csv_text(self):
"""A function which returns a string of an
csv document.
Returns:
text(str): String of text of a document.
"""
text = str()
with open(self.file_path, "r") as csvFile:
lines = csvFile.read().split(",") # "rn" if needed
for val in lines:
text += val + ' '
text = text.replace('ufeff', '')
text = text.replace('n', ' ')
return text
#Powerpoint Reader Class
class pptReader:
def __init__(self, file_path: str) -> str:
self.file_path = file_path
def ppt_text(self):
"""A function which returns a string of an
Mirocsoft PowerPoint document.
Returns:
text(str): String of text of a document.
"""
prs = Presentation(self.file_path)
text = str()
for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text += ' ' + run.text
return text
#Word Document Reader Class
class wordDocReader:
def __init__(self, file_path: str) -> str:
self.file_path = file_path
def word_reader(self):
"""A function which returns a string of an
Microsoft Word document.
Returns:
text(str): String of text of a document.
"""
text = docx2txt.process(self.file_path)
text = text.replace('n', ' ')
text = text.replace('xa0', ' ')
text = text.replace('t', ' ')
return text
#Data Processing and Recommendation Functions
class dataprocessor:
def __init__(self):
return
@staticmethod
def get_wordnet_pos(text: str) -> str:
"""Map POS tag to first character lemmatize() accepts
Inputs:
text(str): A string of text
Returns:
tag_dict(dict): A dictionary of tags
"""
tag = nltk.pos_tag([text])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
@staticmethod
def preprocess(text: str):
"""A function that prepoccesses text through the
steps of Natural Language Processing (NLP).
Inputs:
text(str): A string of text
Returns:
text(str): A processed string of text
"""
#lowercase
text = text.lower()
#punctuation removal
text = "".join([i for i in text if i not in string.punctuation])
#Digit removal (Only for ALL numeric numbers)
text = [x for x in text.split(' ') if x.isnumeric() == False]
#Stop removal
stopwords = nltk.corpus.stopwords.words('english')
custom_stopwords = ['n','nn', '&', ' ', '.', '-', '$', '@']
stopwords.extend(custom_stopwords)
text = [i for i in text if i not in stopwords]
text = ' '.join(word for word in text)
#lemmanization
lm = WordNetLemmatizer()
text = [lm.lemmatize(word, dataprocessor.get_wordnet_pos(word)) for word in text.split(' ')]
text = ' '.join(word for word in text)
text = re.sub(' +', ' ',text)
return text
@staticmethod
def data_reader(list_file_names, file_dict = dict()):
"""A function that reads in the data from a directory of files.
Inputs:
list_file_names(list): List of the filepaths in a directory.
Returns:
text_list (list): A list where each value is a string of the text
for each file in the directory
file_dict(dict): Dictionary where the keys are the filename and the values
are the information found within each given file
"""
text_list = []
reader = dataprocessor()
for file in list_file_names:
temp = file.split('.')
filetype = temp[-1]
if filetype == "pdf":
file_pdf = pdfReader(file)
text = file_pdf.PDF_one_pager()
elif filetype == "docx":
word_doc_reader = wordDocReader(file)
text = word_doc_reader.word_reader()
elif filetype == "pptx" or filetype == 'ppt':
ppt_reader = pptReader(file)
text = ppt_reader.ppt_text()
elif filetype == "csv":
csv_reader = csvReader(file)
text = csv_reader.csv_text()
elif filetype == 'xlsx':
xl_reader = xlsxReader(file)
text = xl_reader.xlsx_text()
else:
print('File type {} not supported!'.format(filetype))
continue
text = reader.preprocess(text)
text_list.append(text)
for i,file in enumerate(list_file_names):
file_dict[i] = (file, file.split('/')[-1])
return text_list, file_dict
@staticmethod
def database_processor(file_dict,text_list: list):
"""A function that transforms the text of each file within the
database into a vector.
Inputs:
file_dixt(dict): Dictionary where the keys are the filename and the values
are the information found within each given file
text_list (list): A list where each value is a string of the text
for each file in the directory
Returns:
list_dense(list): A list of the files' text turned into vectors.
vectorizer: The vectorizor used to transform the strings of text
file_vector_dict(dict): A dictionary where the file names are the keys
and the vectors of each files' text are the values.
"""
file_vector_dict = dict()
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(text_list)
feature_names = vectorizer.get_feature_names_out()
matrix = vectors.todense()
list_dense = matrix.tolist()
for i in range(len(list_dense)):
file_vector_dict[file_dict[i][1]] = list_dense[i]
return list_dense, vectorizer, file_vector_dict
@staticmethod
def input_processor(text, vectorizer):
"""A function that accepts a string of text and vectorizes it using a
TFIDF vectorizer.
Inputs:
text (str): A string of text
vectorizer: A pre-trained TFIDF vectorizer
Returns:
text_vector (sparse matrix): The vectorized form of the input text
"""
# If the input is mistakenly passed as a list, join it into a single string
if isinstance(text, list):
text = ' '.join(text)
# Convert the text to its TF-IDF vectorized form
text_vector = vectorizer.transform([text]) # Note that vectorizer expects a list of strings
return text_vector
@staticmethod
def similarity_checker(vector_1, vector_2):
"""A function accepts two vectors and computes their cosine similarity.
Inputs:
vector_1(int): A numerical vector
vector_2(int): A numerical vector
Returns:
cosine_similarity([vector_1], vector_2) (int): Cosine similarity score
"""
vectors = [vector_1, vector_2]
for vec in vectors:
if np.ndim(vec) == 1:
vec = np.expand_dims(vec, axis=0)
return cosine_similarity([vector_1], vector_2)
@staticmethod
def recommender(vector_file_list, query_vector, file_dict, lda_model, lda_vectorizer, topic_distributions):
"""Recommender method accepts vectors instead of raw text."""
similarity_list = []
topic_similarity_list = []
score_dict = dict()
# Get topic distribution for the query
query_topic_dist = dataprocessor.get_topic_distribution(query_vector, lda_model, lda_vectorizer)
for i, file_vector in enumerate(vector_file_list):
# Compute cosine similarity between vector representations
cosine_sim = dataprocessor.similarity_checker(file_vector, query_vector.toarray()) # Use toarray() to convert csr_matrix to array
score_dict[file_dict[i][1]] = cosine_sim[0][0]
# Compute topic similarity
doc_topic_dist = topic_distributions[i].reshape(1, -1)
topic_sim = dataprocessor.topic_similarity(query_topic_dist, doc_topic_dist)
topic_similarity_list.append(topic_sim)
# Combine cosine similarity and topic similarity
combined_score = 0.7 * cosine_sim[0][0] + 0.3 * topic_sim
similarity_list.append(combined_score)
# Sort the combined similarities
recommended = sorted(score_dict.items(), key=lambda x: -x[1])[:int(np.round(.5*len(similarity_list)))]
final_recommendation = [rec[0] for rec in recommended]
return final_recommendation, similarity_list[:len(final_recommendation)], topic_similarity_list[:len(final_recommendation)]
@staticmethod
def ranker(recommendation_val, file_vec_dict):
"""A function accepts a list of recommendaton values and a dictionary
files wihin the databse and their vectors.
Inputs:
reccomendation_val(list): A list of recommendations found through cosine
similarity
file_vec_dic(dict): A dictionary of the filenames as keys and their
text in vectors as the values.
Returns:
ec_recommended(list): A list of the top 20% recommendations found using the
eigenvector centrality algorithm.
"""
my_graph = nx.Graph()
for i in range(len(recommendation_val)):
file_1 = recommendation_val[i]
for j in range(len(recommendation_val)):
file_2 = recommendation_val[j]
if i != j:
#Calculate sim_score between two values (weight)
edge_dist = cosine_similarity([file_vec_dict[recommendation_val[i]]],[file_vec_dict[recommendation_val[j]]])
#add an edge from file 1 to file 2 with the weight
# Check if edge_dist is greater than a threshold (e.g., 0.1)
# This prevents adding edges with very low similarity
if edge_dist[0][0] > 0.1: #Added threshold to prevent zero values
my_graph.add_edge(file_1, file_2, weight=edge_dist[0][0])
# Check if the graph is empty before calculating eigenvector centrality
if my_graph.number_of_nodes() == 0:
# Handle the case where the graph is empty
# This could involve returning an empty list or a default ranking
print("Warning: Empty graph. Returning original recommendations.")
return [(rec, 1) for rec in recommendation_val] # Return original recommendations with default rank
else:
#Pagerank the graph
rec = nx.eigenvector_centrality(my_graph, max_iter = 1000) #Added an max iterations in case of convergence issues
#Takes 20% of the values
ec_recommended = sorted(rec.items(), key=lambda x:-x[1])[:int(np.round(len(rec)))]
return ec_recommended
@staticmethod
def weighted_final_rank(sim_list, ec_recommended, final_recommendation, topic_sim_list):
"""A function that incorporates topic similarity into the final ranking."""
final_dict = dict()
for i in range(len(sim_list)):
combined_value = (0.6 * sim_list[final_recommendation.index(ec_recommended[i][0])].squeeze()) +
(0.2 * ec_recommended[i][1]) +
(0.2 * topic_sim_list[final_recommendation.index(ec_recommended[i][0])])
final_dict[ec_recommended[i][0]] = combined_value
weighted_final_recommend = sorted(final_dict.items(), key=lambda x: -x[1])[:int(np.round(len(final_dict)))]
return weighted_final_recommend
@staticmethod
def fit_topic_model(documents, n_topics=10):
"""Fit a topic model (LDA) on the documents and return the model and topic distributions."""
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)
lda_model = LatentDirichletAllocation(n_components=n_topics, random_state=42)
topic_distributions = lda_model.fit_transform(tfidf_matrix)
return lda_model, vectorizer, topic_distributions
@staticmethod
def get_topic_distribution(text, lda_model, vectorizer):
"""Get the topic distribution for a given text using the LDA model."""
# Check if text is a csr_matrix and convert to dense if necessary
if isinstance(text, csr_matrix):
text = text.toarray()
# Assuming text is now a string or dense array, can be converted to string
text = str(text)
text_vector = vectorizer.transform([text]) #Applying transform to str(text)
topic_distribution = lda_model.transform(text_vector)
return topic_distribution
# ... other methods ...
@staticmethod
def topic_similarity(query_topic_dist, document_topic_dist):
"""Compute similarity between query topic distribution and document topic distribution."""
return cosine_similarity(query_topic_dist, document_topic_dist)[0][0]
#Final Code To Get Recommendations
# Define the path to the database
path = 'your path here'
db = [f for f in glob.glob(path + '*')]
# Read documents and create the file dictionary
research_documents, file_dictionary = dataprocessor.data_reader(db)
# Preprocess and vectorize the documents
list_files, vectorizer, file_vec_dict = dataprocessor.database_processor(file_dictionary, research_documents)
# Fit the topic model (LDA)
lda_model, lda_vectorizer, topic_distributions = dataprocessor.fit_topic_model(research_documents)
# Define the query
query = 'generative adversarial network images'
# Preprocess the query
query = dataprocessor.preprocess(query)
# Convert query to vector
query_vector = dataprocessor.input_processor(query, vectorizer)
# Get recommendations using both cosine similarity and topic similarity
recommendation, sim_list, topic_sim_list = dataprocessor.recommender(list_files, query_vector, file_dictionary, lda_model, lda_vectorizer, topic_distributions)
# Rank the recommendations using eigenvector centrality
ec_recommendation = dataprocessor.ranker(recommendation, file_vec_dict)
# Get the final weighted recommendation, incorporating topic similarity
final_weighted_recommended = dataprocessor.weighted_final_rank(sim_list, ec_recommendation, recommendation, topic_sim_list)
# Print the final recommendations
print(final_weighted_recommended)
query = "Generative Adversarial Networks"
Example Output
[('Using Generative Adversarial Networks to Augment Unmanned Aerial.pdf', 0.5214213562373095), (
'Published Research - Extracting Insight from Small Corpora.pdf', 0.27088670405465454)]
Topics?
The following code is use full if you want to know the topics of the documents in your database as well as your query
#Topic Check
def print_document_topics(file_dict, topic_distributions, lda_model, vectorizer, n_top_words=10):
"""Prints the topics for each document along with its filename."""
# Get the feature names from the vectorizer (i.e., the words in the vocabulary)
feature_names = vectorizer.get_feature_names_out()
# Print topics for each document
for i, doc_topic_dist in enumerate(topic_distributions):
file_name = file_dict[i][1]
print(f"nFile: {file_name}")
print(f"Topic distribution: {doc_topic_dist}")
# For each document, print the most prominent topics and their words
sorted_topics = doc_topic_dist.argsort()[::-1] # Sort topics in descending order
for topic_idx in sorted_topics[:3]: # Show the top 3 topics
topic_words = [feature_names[i] for i in lda_model.components_[topic_idx].argsort()[:-n_top_words - 1:-1]]
print(f" Topic {topic_idx}: {' '.join(topic_words)}")
def print_query_topics(query, lda_model, vectorizer, n_top_words=10):
"""Prints the topics for the query."""
# Vectorize and get the topic distribution for the query
query_vector = vectorizer.transform([query])
query_topic_dist = lda_model.transform(query_vector)[0]
# Get the feature names from the vectorizer
feature_names = vectorizer.get_feature_names_out()
print(f"nQuery: {query}")
print(f"Topic distribution: {query_topic_dist}")
# Print the most prominent topics for the query
sorted_topics = query_topic_dist.argsort()[::-1] # Sort topics in descending order
for topic_idx in sorted_topics[:3]: # Show the top 3 topics
topic_words = [feature_names[i] for i in lda_model.components_[topic_idx].argsort()[:-n_top_words - 1:-1]]
print(f" Topic {topic_idx}: {' '.join(topic_words)}")
# Assuming topic_distributions, lda_model, vectorizer, and file_dict have already been defined
# Print topics for each document in the database
print_document_topics(file_dictionary, topic_distributions, lda_model, lda_vectorizer)
# Define your query
query = 'generative adversarial network images'
query = dataprocessor.preprocess(query)
# Print topics for the query
print_query_topics(query, lda_model, lda_vectorizer)
Example Output
File: 3. McCloskey, Cox, Champagne & Bihl - Benefits of using blended generative adversarial network images to augment classification model training data sets - Copy.pdf
Topic distribution: [0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]
Topic 9: collection need machine look pattern make follow point hyperparameter level
Topic 8: collection need machine look pattern make follow point hyperparameter level
Topic 7: collection need machine look pattern make follow point hyperparameter level
File: Published Research - Extracting Insight from Small Corpora.pdf
Topic distribution: [0.00522037 0.00522037 0.00522404 0.00522037 0.95301298 0.00522037
0.00522037 0.00522037 0.00522037 0.00522037]
Topic 4: review word topic chatgpt corpus research business model customer data
Topic 2: image model dataset training combine object truck car gan set
Topic 9: collection need machine look pattern make follow point hyperparameter level
File: MS_THESIS - Using Generative Adversarial Networks to Augment Unmanned Aerial.pdf
Topic distribution: [0.00619424 0.00619424 0.94424855 0.00619424 0.00619753 0.00619424
0.00619424 0.00619424 0.00619424 0.00619424]
Topic 2: image model dataset training combine object truck car gan set
Topic 4: review word topic chatgpt corpus research business model customer data
Topic 9: collection need machine look pattern make follow point hyperparameter level
Query: generative adversarial network image
Topic distribution: [0.03350991 0.03350991 0.69841058 0.03350991 0.03351012 0.03350991
0.03350991 0.03350991 0.03350991 0.03350991]
Topic 2: image model dataset training combine object truck car gan set
Topic 4: review word topic chatgpt corpus research business model customer data
Topic 9: collection need machine look pattern make follow point hyperparameter level