Topic Alignment for NLP Recommender Systems

Author:Murphy  |  View: 21390  |  Time: 2025-03-22 20:05:10

in a Natural Language Processing (NLP)-based system.

Photo by Emmanuel Ikwuegbu on Unsplash

Introduction

As the capabilities in Large Language Models (LLM), such as ChatGPT and Llama, continue to increase, a growing area of research revolves around adapting semantic reasoning to these systems. While these models do a great job providing responses grounded in predictions based on prior human knowledge, the issues arising with hallucinations, generic answers, as well as answers that don't fulfill the users request are still common. Recommendation systems are parallel to LLMs in how they provide recommendations based on user input. Today, we will look at further enhancements in recommendation responses when adding additional metadata of the topics within a query and how they align with the data used to create a response.

This research is important because it could ultimately lead to enhancements in the semantic depth of large language models (LLMs) by incorporating human-like abilities to infer overarching topics inherent in a body of information.

Topic Modeling Overview

Let's do a quick refresher on Topic Modeling.

Topic modeling is a machine learning technique used to identify and extract hidden themes or topics from large collections of text data.

The basic steps of Topic Modeling Include:

  1. Collect data (Textual in nature. A document of text is known as a corpus)
  2. Use general Natural Language Processing (NLP) practices to preprocess the data (Tokenize, lowercase, remove punctuation and stopwords, lemmatization, etc.)
  3. Word-Document Matrix Creation – This matrix's rows are represented as documents where the value of each entry is the frequency of that word within its given data point.
  4. Topic Identification – An algorithm like Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF) is applied to detect patterns in word usage and group frequently co-occurring words into topics.
  5. Document-Topic Distribution – Documents are represented as a combination of topics and weights are assigned based off the level of contribution from each topic to a given document.
  6. Interpretation – Topics inferred from calculated weights.
Figure: Topic Modeling (Created by Author)

The image above illustrates the general process. We apply algorithms such as LDA or NMF to identify the underlying topics within a corpus. Using these topics, we can cluster documents based on their thematic similarity. Additionally, inference can be used to derive deeper insights by analyzing the relationships between the topics and their respective weights within each document, revealing the overarching meaning within the corpus.

Incorporating topics offers significant benefits by organizing data, uncovering hidden themes, and enhancing the relevance of information. Relevance is crucial, yet often overlooked, when systems deliver information, particularly in the case of Large Language Models (LLMs). This is due to the current limitations in semantic understanding within these systems. However, I anticipate that advancements in this area will resolve these challenges in the coming years.

Past Work

You can find the original recommendation system article I wrote here. A quick summary of how the code works

  • pdfReader: Extracts text from a PDF as a single line string.
  • xlsxReader: Extracts text from an Excel document as a single line string.
  • csvReader: Extracts text from a CSV File as a single line string.
  • pptReader: Extracts text from a powerpoint as a single line string
  • wordDocReader: Extracts text from a single Word
  • dataprocessor: Designed to process and analyze a collection of documents, perform text preprocessing, transform text into vectors, and recommend documents based on text similarity.

The below diagram explains the workflow of the code and how recommendations are provided to the user.

Figure: General Recommendation System Architecture (Created by Author)

First, the user creates a database (simply just putting files in to a folder) of the information they wish to query. I recommend using information you want to archive, but may be interested to use in the future. The database is then processed using common NLP preprocessing and stored. The user then provides a query and a weight recommendation is provided to the user. The system uses a combination of term frequency, cosine similarity and distance similarity through network analysis to provide its recommendation. The weighted recommendation is as follows:

The weighted output is based on the computed metrics within the system. Based off your preferences, you can change the weights accordingly to output your tailored final recommended list.

Differences Between Old and New

Figure: General Recommendation System Architecture with Topic Modeling (Created by Author)

The process is the same as before, expected now topics are included and similarity score is calculated from the topics uncovered in the query to the topics in the vectorized database.

These weights can be adjusted based off the preference of the user, but make sure they equal 1! Now, the output will incorporate topic information, which can help prescribe better information to the user! Here is an example output of some of the information we can obtain:

Example Output

File: 3. McCloskey, Cox, Champagne & Bihl - Benefits of using blended generative adversarial network images to augment classification model training data sets - Copy.pdf
Topic distribution: [0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]
  Topic 9: collection need machine look pattern make follow point hyperparameter level
  Topic 8: collection need machine look pattern make follow point hyperparameter level
  Topic 7: collection need machine look pattern make follow point hyperparameter level

File: Published Research - Extracting Insight from Small Corpora.pdf
Topic distribution: [0.00522037 0.00522037 0.00522404 0.00522037 0.95301298 0.00522037
 0.00522037 0.00522037 0.00522037 0.00522037]
  Topic 4: review word topic chatgpt corpus research business model customer data
  Topic 2: image model dataset training combine object truck car gan set
  Topic 9: collection need machine look pattern make follow point hyperparameter level

File: MS_THESIS - Using Generative Adversarial Networks to Augment Unmanned Aerial.pdf
Topic distribution: [0.00619424 0.00619424 0.94424855 0.00619424 0.00619753 0.00619424
 0.00619424 0.00619424 0.00619424 0.00619424]
  Topic 2: image model dataset training combine object truck car gan set
  Topic 4: review word topic chatgpt corpus research business model customer data
  Topic 9: collection need machine look pattern make follow point hyperparameter level

Query: generative adversarial network image
Topic distribution: [0.03350991 0.03350991 0.69841058 0.03350991 0.03351012 0.03350991
 0.03350991 0.03350991 0.03350991 0.03350991]
  Topic 2: image model dataset training combine object truck car gan set
  Topic 4: review word topic chatgpt corpus research business model customer data
  Topic 9: collection need machine look pattern make follow point hyperparameter level

I uploaded different papers I have published and calculated their topics. You can see at the bottom, the system also calculated the topic of my query generative adversarial network image. This added an additional layer of semantic understanding to my system, and now we are incorporating more background information! This is critical to ensure you create a more tailored recommendation system.

Future Work

  • Have a database of topics and then use these as tags for the query and documents in the Retrevial Augmented Generation (RAG) database. This can help add more stability to the process as well as semantic understanding.
  • Use an LLM to create a summary of the topics from the query and the RAG database, and figure out which topics align the most.

Conclusion

Today, we reviewed how Topic Modeling can be used to help align answers provided by a recommendation engine to a user. This theory seeks to add knowledge based understanding humans use to the system, incorporating our ability to formulate specific topics about a body of information. This is important for two reasons.

  1. While it takes a more macro-level view and provides a metric based on information generalization, the generalization is quantifiable in its approach and is not a simple yes or no decision.
  2. It helps overcome issues at a microlevel, where specific words can be compared between two entities and overtime, the meaning behind a body of information is misconstrued.
  3. The idea of using topics for recommendation systems is extendable to LLMs, and can help us further understand how to adapt semantic meaning to these systems as well as observe how LLM architectures create meaning from abstract information. This is a growing area of research right now, and I invite you to give it a thought!

If you enjoyed today's article and want to read more, give me a follow! Please feel free to recommend topics for other posts you would like to see. Thanks for reading!

Code

#Imports
import string
import csv
from io import StringIO
from pptx import Presentation
import docx2txt
import PyPDF2
import spacy
import pandas as pd
import numpy as np
import nltk
import re
import openpyxl
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.parsing.preprocessing import STOPWORDS as SW
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix
from nltk.corpus import wordnet
import networkx as nx
from networkx.algorithms.shortest_paths import weighted
import glob
import sys

#Add system path
sys.path.append('your_path here')

#NLTK downloads for preprocessing
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

#PDF Reader Class
class pdfReader:

    def __init__(self, file_path: str) -> str:
        self.file_path = file_path

    def PDF_one_pager(self) -> str:
        """A function returns a one line string of the
            pdf.

            Returns:
            one_page_pdf (str): A one line string of the pdf.

        """
        content = ""
        p = open(self.file_path, "rb")
        pdf = PyPDF2.PdfReader(p)
        num_pages = len(pdf.pages)
        for i in range(0, num_pages):
            content += pdf.pages[i].extract_text() + "n"
        content = " ".join(content.replace(u"xa0", " ").strip().split())
        page_number_removal = r"d{1,3} of d{1,3}"
        page_number_removal_pattern = re.compile(page_number_removal, re.IGNORECASE)
        content = re.sub(page_number_removal_pattern, '',content)

        return content

    def pdf_reader(self) -> str:
        """A function that can read .pdf formatted files
            and returns a python readable pdf.

            Returns:
            read_pdf: A python readable .pdf file.
        """
        opener = open(self.file_path,'rb')
        read_pdf = PyPDF2.PdfFileReader(opener)

        return read_pdf

    def pdf_info(self) -> dict:
        """A function which returns an information dictionary of a
        pdf.

        Returns:
        dict(pdf_info_dict): A dictionary containing the meta
        data of the object.
        """
        opener = open(self.file_path,'rb')
        read_pdf = PyPDF2.PdfFileReader(opener)
        pdf_info_dict = {}
        for key,value in read_pdf.documentInfo.items():
            pdf_info_dict[re.sub('/',"",key)] = value
        return pdf_info_dict

    def pdf_dictionary(self) -> dict:
        """A function which returns a dictionary of
            the object where the keys are the pages
            and the text within the pages are the
            values.

            Returns:
            dict(pdf_dict): A dictionary pages and text.
        """
        opener = open(self.file_path,'rb')

        read_pdf = PyPDF2.PdfReader(opener)
        length = read_pdf.pages
        pdf_dict = {}
        for i in range(length):
            page = read_pdf.getPage(i)
            text = page.extract_text()
            pdf_dict[i] = text
            return pdf_dict

#Excel Reader Class
class xlsxReader:

    def __init__(self, file_path: str) -> str:
        self.file_path = file_path

    def xlsx_text(self):
      """A function which returns a string of an
         excel document.

          Returns:
          text(str): String of text of a document.
      """
      inputExcelFile = self.file_path
      text = str()
      wb = openpyxl.load_workbook(inputExcelFile)
      for sn in wb.sheetnames:
        excelFile = pd.read_excel(inputExcelFile, engine = 'openpyxl', sheet_name = sn)
        excelFile.to_csv("ResultCsvFile.csv", index = None, header=True)

        with open("ResultCsvFile.csv", "r") as csvFile:
          lines = csvFile.read().split(",") # "rn" if needed
          for val in lines:
            if val != '':
              text += val + ' '
          text = text.replace('ufeff', '')
          text = text.replace('n', ' ')
      return text

#CSV Reader Class
class csvReader:

    def __init__(self, file_path: str) -> str:
        self.file_path = file_path

    def csv_text(self):
      """A function which returns a string of an
         csv document.

          Returns:
          text(str): String of text of a document.
      """
      text = str()
      with open(self.file_path, "r") as csvFile:
        lines = csvFile.read().split(",") # "rn" if needed
        for val in lines:
          text += val + ' '
        text = text.replace('ufeff', '')
        text = text.replace('n', ' ')
      return text

#Powerpoint Reader Class
class pptReader:

    def __init__(self, file_path: str) -> str:
        self.file_path = file_path

    def ppt_text(self):
      """A function which returns a string of an
        Mirocsoft PowerPoint document.

        Returns:
        text(str): String of text of a document.
    """
      prs = Presentation(self.file_path)
      text = str()
      for slide in prs.slides:
        for shape in slide.shapes:
          if not shape.has_text_frame:
              continue
          for paragraph in shape.text_frame.paragraphs:
            for run in paragraph.runs:
              text += ' ' + run.text

      return text

#Word Document Reader Class
class wordDocReader:
  def __init__(self, file_path: str) -> str:
    self.file_path = file_path

  def word_reader(self):
    """A function which returns a string of an
          Microsoft Word document.

          Returns:
          text(str): String of text of a document.
      """
    text = docx2txt.process(self.file_path)
    text = text.replace('n', ' ')
    text = text.replace('xa0', ' ')
    text = text.replace('t', ' ')
    return text

#Data Processing and Recommendation Functions
class dataprocessor:
  def __init__(self):
    return

  @staticmethod
  def get_wordnet_pos(text: str) -> str:
    """Map POS tag to first character lemmatize() accepts
    Inputs:
    text(str): A string of text

    Returns:
    tag_dict(dict): A dictionary of tags
    """
    tag = nltk.pos_tag([text])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

  @staticmethod
  def preprocess(text: str):
    """A function that prepoccesses text through the
    steps of Natural Language Processing (NLP).
      Inputs:
      text(str): A string of text

      Returns:
      text(str): A processed string of text
      """
    #lowercase
    text = text.lower()

    #punctuation removal
    text = "".join([i for i in text if i not in string.punctuation])

    #Digit removal (Only for ALL numeric numbers)
    text = [x for x in text.split(' ') if x.isnumeric() == False]

    #Stop removal
    stopwords = nltk.corpus.stopwords.words('english')
    custom_stopwords = ['n','nn', '&', ' ', '.', '-', '$', '@']
    stopwords.extend(custom_stopwords)

    text = [i for i in text if i not in stopwords]
    text = ' '.join(word for word in text)

    #lemmanization
    lm = WordNetLemmatizer()
    text = [lm.lemmatize(word, dataprocessor.get_wordnet_pos(word)) for word in text.split(' ')]
    text = ' '.join(word for word in text)

    text = re.sub(' +', ' ',text)

    return text

  @staticmethod
  def data_reader(list_file_names, file_dict = dict()):
    """A function that reads in the data from a directory of files.

    Inputs:
    list_file_names(list): List of the filepaths in a directory.

    Returns:
    text_list (list): A list where each value is a string of the text
    for each file in the directory
    file_dict(dict): Dictionary where the keys are the filename and the values
    are the information found within each given file
    """

    text_list = []
    reader = dataprocessor()
    for file in list_file_names:
      temp = file.split('.')
      filetype = temp[-1]
      if filetype == "pdf":
        file_pdf = pdfReader(file)
        text = file_pdf.PDF_one_pager()

      elif filetype == "docx":
        word_doc_reader = wordDocReader(file)
        text = word_doc_reader.word_reader()

      elif filetype == "pptx" or filetype == 'ppt':
        ppt_reader = pptReader(file)
        text = ppt_reader.ppt_text()

      elif filetype == "csv":
        csv_reader = csvReader(file)
        text = csv_reader.csv_text()

      elif filetype == 'xlsx':
        xl_reader = xlsxReader(file)
        text = xl_reader.xlsx_text()
      else:
        print('File type {} not supported!'.format(filetype))
        continue

      text = reader.preprocess(text)
      text_list.append(text)

      for i,file in enumerate(list_file_names):
        file_dict[i] = (file, file.split('/')[-1])
    return text_list, file_dict

  @staticmethod
  def database_processor(file_dict,text_list: list):
    """A function that transforms the text of each file within the
    database into a vector.

    Inputs:
    file_dixt(dict): Dictionary where the keys are the filename and the values
    are the information found within each given file
    text_list (list): A list where each value is a string of the text
    for each file in the directory

    Returns:
    list_dense(list): A list of the files' text turned into vectors.
    vectorizer: The vectorizor used to transform the strings of text
    file_vector_dict(dict): A dictionary where the file names are the keys
    and the vectors of each files' text are the values.
    """
    file_vector_dict = dict()
    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform(text_list)
    feature_names = vectorizer.get_feature_names_out()
    matrix = vectors.todense()
    list_dense = matrix.tolist()
    for i in range(len(list_dense)):
      file_vector_dict[file_dict[i][1]] = list_dense[i]

    return list_dense, vectorizer, file_vector_dict
  @staticmethod
  def input_processor(text, vectorizer):
      """A function that accepts a string of text and vectorizes it using a
      TFIDF vectorizer.

      Inputs:
      text (str): A string of text
      vectorizer: A pre-trained TFIDF vectorizer

      Returns:
      text_vector (sparse matrix): The vectorized form of the input text
      """
      # If the input is mistakenly passed as a list, join it into a single string
      if isinstance(text, list):
          text = ' '.join(text)

      # Convert the text to its TF-IDF vectorized form
      text_vector = vectorizer.transform([text])  # Note that vectorizer expects a list of strings
      return text_vector

  @staticmethod
  def similarity_checker(vector_1, vector_2):
    """A function accepts two vectors and computes their cosine similarity.

    Inputs:
    vector_1(int): A numerical vector
    vector_2(int): A numerical vector

    Returns:
    cosine_similarity([vector_1], vector_2) (int): Cosine similarity score
    """
    vectors = [vector_1, vector_2]
    for vec in vectors:
      if np.ndim(vec) == 1:
        vec = np.expand_dims(vec, axis=0)
    return cosine_similarity([vector_1], vector_2)

  @staticmethod
  def recommender(vector_file_list, query_vector, file_dict, lda_model, lda_vectorizer, topic_distributions):
      """Recommender method accepts vectors instead of raw text."""

      similarity_list = []
      topic_similarity_list = []
      score_dict = dict()

      # Get topic distribution for the query
      query_topic_dist = dataprocessor.get_topic_distribution(query_vector, lda_model, lda_vectorizer)

      for i, file_vector in enumerate(vector_file_list):
          # Compute cosine similarity between vector representations
          cosine_sim = dataprocessor.similarity_checker(file_vector, query_vector.toarray())  # Use toarray() to convert csr_matrix to array
          score_dict[file_dict[i][1]] = cosine_sim[0][0]

          # Compute topic similarity
          doc_topic_dist = topic_distributions[i].reshape(1, -1)
          topic_sim = dataprocessor.topic_similarity(query_topic_dist, doc_topic_dist)
          topic_similarity_list.append(topic_sim)

          # Combine cosine similarity and topic similarity
          combined_score = 0.7 * cosine_sim[0][0] + 0.3 * topic_sim
          similarity_list.append(combined_score)

      # Sort the combined similarities
      recommended = sorted(score_dict.items(), key=lambda x: -x[1])[:int(np.round(.5*len(similarity_list)))]

      final_recommendation = [rec[0] for rec in recommended]
      return final_recommendation, similarity_list[:len(final_recommendation)], topic_similarity_list[:len(final_recommendation)]

  @staticmethod
  def ranker(recommendation_val, file_vec_dict):
    """A function accepts a list of recommendaton values and a dictionary
    files wihin the databse and their vectors.

    Inputs:
    reccomendation_val(list): A list of recommendations found through cosine
    similarity
    file_vec_dic(dict): A dictionary of the filenames as keys and their
    text in vectors as the values.

    Returns:
    ec_recommended(list): A list of the top 20% recommendations found using the
    eigenvector centrality algorithm.
    """
    my_graph = nx.Graph()
    for i in range(len(recommendation_val)):
      file_1 = recommendation_val[i]
      for j in range(len(recommendation_val)):
        file_2 = recommendation_val[j]

        if i != j:
          #Calculate sim_score between two values (weight)
          edge_dist = cosine_similarity([file_vec_dict[recommendation_val[i]]],[file_vec_dict[recommendation_val[j]]])
          #add an edge from file 1 to file 2 with the weight
          # Check if edge_dist is greater than a threshold (e.g., 0.1)
          # This prevents adding edges with very low similarity

          if edge_dist[0][0] > 0.1: #Added threshold to prevent zero values
            my_graph.add_edge(file_1, file_2, weight=edge_dist[0][0])

    # Check if the graph is empty before calculating eigenvector centrality
    if my_graph.number_of_nodes() == 0:
        # Handle the case where the graph is empty
        # This could involve returning an empty list or a default ranking
        print("Warning: Empty graph. Returning original recommendations.")
        return [(rec, 1) for rec in recommendation_val] # Return original recommendations with default rank
    else:
      #Pagerank the graph
      rec = nx.eigenvector_centrality(my_graph, max_iter = 1000) #Added an max iterations in case of convergence issues
      #Takes 20% of the values
      ec_recommended = sorted(rec.items(), key=lambda x:-x[1])[:int(np.round(len(rec)))]

      return ec_recommended

  @staticmethod
  def weighted_final_rank(sim_list, ec_recommended, final_recommendation, topic_sim_list):
      """A function that incorporates topic similarity into the final ranking."""
      final_dict = dict()

      for i in range(len(sim_list)):
          combined_value = (0.6 * sim_list[final_recommendation.index(ec_recommended[i][0])].squeeze()) + 
                          (0.2 * ec_recommended[i][1]) + 
                          (0.2 * topic_sim_list[final_recommendation.index(ec_recommended[i][0])])
          final_dict[ec_recommended[i][0]] = combined_value

      weighted_final_recommend = sorted(final_dict.items(), key=lambda x: -x[1])[:int(np.round(len(final_dict)))]

      return weighted_final_recommend

  @staticmethod
  def fit_topic_model(documents, n_topics=10):
      """Fit a topic model (LDA) on the documents and return the model and topic distributions."""
      vectorizer = TfidfVectorizer(stop_words='english')
      tfidf_matrix = vectorizer.fit_transform(documents)

      lda_model = LatentDirichletAllocation(n_components=n_topics, random_state=42)
      topic_distributions = lda_model.fit_transform(tfidf_matrix)

      return lda_model, vectorizer, topic_distributions

  @staticmethod
  def get_topic_distribution(text, lda_model, vectorizer):
      """Get the topic distribution for a given text using the LDA model."""
      # Check if text is a csr_matrix and convert to dense if necessary
      if isinstance(text, csr_matrix):
          text = text.toarray()

      # Assuming text is now a string or dense array, can be converted to string
      text = str(text)

      text_vector = vectorizer.transform([text]) #Applying transform to str(text)
      topic_distribution = lda_model.transform(text_vector)

      return topic_distribution

  # ... other methods ...

  @staticmethod
  def topic_similarity(query_topic_dist, document_topic_dist):
      """Compute similarity between query topic distribution and document topic distribution."""
      return cosine_similarity(query_topic_dist, document_topic_dist)[0][0]

#Final Code To Get Recommendations
# Define the path to the database
path = 'your path here'

db = [f for f in glob.glob(path + '*')]

# Read documents and create the file dictionary
research_documents, file_dictionary = dataprocessor.data_reader(db)

# Preprocess and vectorize the documents
list_files, vectorizer, file_vec_dict = dataprocessor.database_processor(file_dictionary, research_documents)

# Fit the topic model (LDA)
lda_model, lda_vectorizer, topic_distributions = dataprocessor.fit_topic_model(research_documents)

# Define the query
query = 'generative adversarial network images'

# Preprocess the query
query = dataprocessor.preprocess(query)

# Convert query to vector
query_vector = dataprocessor.input_processor(query, vectorizer)

# Get recommendations using both cosine similarity and topic similarity
recommendation, sim_list, topic_sim_list = dataprocessor.recommender(list_files, query_vector, file_dictionary, lda_model, lda_vectorizer, topic_distributions)

# Rank the recommendations using eigenvector centrality
ec_recommendation = dataprocessor.ranker(recommendation, file_vec_dict)

# Get the final weighted recommendation, incorporating topic similarity
final_weighted_recommended = dataprocessor.weighted_final_rank(sim_list, ec_recommendation, recommendation, topic_sim_list)

# Print the final recommendations
print(final_weighted_recommended)

query = "Generative Adversarial Networks"

Example Output
[('Using Generative Adversarial Networks to Augment Unmanned Aerial.pdf', 0.5214213562373095), (
'Published Research - Extracting Insight from Small Corpora.pdf', 0.27088670405465454)]

Topics?

The following code is use full if you want to know the topics of the documents in your database as well as your query

#Topic Check
def print_document_topics(file_dict, topic_distributions, lda_model, vectorizer, n_top_words=10):
    """Prints the topics for each document along with its filename."""

    # Get the feature names from the vectorizer (i.e., the words in the vocabulary)
    feature_names = vectorizer.get_feature_names_out()

    # Print topics for each document
    for i, doc_topic_dist in enumerate(topic_distributions):
        file_name = file_dict[i][1]
        print(f"nFile: {file_name}")
        print(f"Topic distribution: {doc_topic_dist}")

        # For each document, print the most prominent topics and their words
        sorted_topics = doc_topic_dist.argsort()[::-1]  # Sort topics in descending order
        for topic_idx in sorted_topics[:3]:  # Show the top 3 topics
            topic_words = [feature_names[i] for i in lda_model.components_[topic_idx].argsort()[:-n_top_words - 1:-1]]
            print(f"  Topic {topic_idx}: {' '.join(topic_words)}")
def print_query_topics(query, lda_model, vectorizer, n_top_words=10):
    """Prints the topics for the query."""

    # Vectorize and get the topic distribution for the query
    query_vector = vectorizer.transform([query])
    query_topic_dist = lda_model.transform(query_vector)[0]

    # Get the feature names from the vectorizer
    feature_names = vectorizer.get_feature_names_out()

    print(f"nQuery: {query}")
    print(f"Topic distribution: {query_topic_dist}")

    # Print the most prominent topics for the query
    sorted_topics = query_topic_dist.argsort()[::-1]  # Sort topics in descending order
    for topic_idx in sorted_topics[:3]:  # Show the top 3 topics
        topic_words = [feature_names[i] for i in lda_model.components_[topic_idx].argsort()[:-n_top_words - 1:-1]]
        print(f"  Topic {topic_idx}: {' '.join(topic_words)}")
# Assuming topic_distributions, lda_model, vectorizer, and file_dict have already been defined

# Print topics for each document in the database
print_document_topics(file_dictionary, topic_distributions, lda_model, lda_vectorizer)

# Define your query
query = 'generative adversarial network images'
query = dataprocessor.preprocess(query)

# Print topics for the query
print_query_topics(query, lda_model, lda_vectorizer)
Example Output

File: 3. McCloskey, Cox, Champagne & Bihl - Benefits of using blended generative adversarial network images to augment classification model training data sets - Copy.pdf
Topic distribution: [0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]
  Topic 9: collection need machine look pattern make follow point hyperparameter level
  Topic 8: collection need machine look pattern make follow point hyperparameter level
  Topic 7: collection need machine look pattern make follow point hyperparameter level

File: Published Research - Extracting Insight from Small Corpora.pdf
Topic distribution: [0.00522037 0.00522037 0.00522404 0.00522037 0.95301298 0.00522037
 0.00522037 0.00522037 0.00522037 0.00522037]
  Topic 4: review word topic chatgpt corpus research business model customer data
  Topic 2: image model dataset training combine object truck car gan set
  Topic 9: collection need machine look pattern make follow point hyperparameter level

File: MS_THESIS - Using Generative Adversarial Networks to Augment Unmanned Aerial.pdf
Topic distribution: [0.00619424 0.00619424 0.94424855 0.00619424 0.00619753 0.00619424
 0.00619424 0.00619424 0.00619424 0.00619424]
  Topic 2: image model dataset training combine object truck car gan set
  Topic 4: review word topic chatgpt corpus research business model customer data
  Topic 9: collection need machine look pattern make follow point hyperparameter level

Query: generative adversarial network image
Topic distribution: [0.03350991 0.03350991 0.69841058 0.03350991 0.03351012 0.03350991
 0.03350991 0.03350991 0.03350991 0.03350991]
  Topic 2: image model dataset training combine object truck car gan set
  Topic 4: review word topic chatgpt corpus research business model customer data
  Topic 9: collection need machine look pattern make follow point hyperparameter level

Tags: Artificial Intelligence ChatGPT Large Language Models Machine Learning Naturallanguageprocessing

Comment