arXiv Keyword Extraction and Analysis Pipeline with KeyBERT and Taipy

Author:Murphy  |  View: 23965  |  Time: 2025-03-23 18:52:58
Photo by Marylou Fortier on Unsplash

As the amount of textual data from sources like social media, customer reviews, and online platforms grows exponentially, we must be able to make sense of this unstructured data.

Keyword extraction and analysis are powerful natural language processing (NLP) techniques that enable us to achieve that.

Keyword extraction involves automatically identifying and extracting the most relevant words from a given text, while keyword analysis involves analyzing the keywords to gain insights into the underlying patterns.

In this step-by-step guide, we explore building a keyword extraction and analysis pipeline and web app on arXiv abstracts using the powerful tools of KeyBERT and Taipy.

Contents

(1) Context (2) Tools Overview (3) Step-by-Step Guide (4) Wrapping it up

Here is the accompanying GitHub repo for this article.


(1) Context

Given the rapid progress in artificial intelligence (AI) and machine learning research, keeping track of the many papers published daily can be challenging.

Regarding such research, arXiv is undoubtedly one of the leading sources of information. arXiv (pronounced ‘archive') is an open-access archive hosting a vast collection of scientific papers covering various disciplines like computer science, mathematics, and more.

arXiv screenshot | Image used under CC 2.0 license

One of the key features of arXiv is that it provides abstracts for each paper uploaded to its platform. These abstracts are an ideal data source as they are concise, rich in technical vocabulary, and contain domain-specific terminology.

Hence, we will utilize the latest batches of arXiv abstracts as the text data to work on in this project.

The goal is to create a web application (comprising a frontend interface and backend pipeline) where users can view the keywords and key phrases of arXiv abstracts based on specific input values.

Screenshot of the completed application user interface | Image by author

(2) Tools Overview

There are three main tools that we will use in this project:

  • arXiv API Python wrapper
  • KeyBERT
  • Taipy

(i) arXiv API Python wrapper

The arXiv website offers public API access to maximize its openness and interoperability. For example, to retrieve the text abstracts as part of our Python workflow, we can use the Python wrapper for the arXiv API.

The arXiv API Python wrapper provides a set of functions for searching the database for papers that match specific criteria, such as author, keyword, category, and more.

It also lets users retrieve detailed metadata about each paper, such as the title, abstract, authors, and publication date.


(ii) KeyBERT

KeyBERT (from the terms ‘keyword' and ‘BERT') is a Python library that provides an easy-to-use interface for using BERT embeddings and cosine similarity to extract the words in a document most representative of the document itself.

Illustration of how KeyBERT works | Image used under MIT License

The biggest strength of KeyBERT is its flexibility. It allows users to easily modify the underlying settings (e.g., parameters, embeddings, tokenization) to experiment and fine-tune the keywords obtained.

In this project, we will be tuning the following set of parameters:

  • Number of the top keywords to be returned
  • Word n-gram range (i.e., minimum and maximum n-gram length)
  • Diversification algorithm (Max Sum Distance or Maximal Marginal Relevance) that determines how the similarity of extracted keywords is defined
  • Number of candidates (if Max Sum Distance is set)
  • Diversity value (if Maximal Marginal Relevance is set)

Both diversification algorithms (Max Sum Distance and Maximal Marginal Relevance) share the same basic idea of balancing two objectives: Retrieve results that are highly relevant to the query and yet are diverse in their content to avoid redundancy amongst each other.


(iii) Taipy

Taipy is an open-source Python application builder that quickly lets developers and data scientists turn data and machine learning algorithms into complete web applications.

While designed to be a low-code library, Taipy also provides a high level of user customization. Therefore, it is well-suited for wide-ranging use cases, from simple dashboarding to production-ready industrial applications.

Taipy components | Image by author

There are two key components of Taipy: Taipy GUI and Taipy Core.

  • Taipy GUI: A simple graphical user interface builder enabling us to easily create an interactive frontend app interface.
  • Taipy Core: A modern backend framework that lets us efficiently build and execute pipelines and scenarios.

While we can use Taipy GUI or Taipy Core independently, combining both allows us to build powerful applications efficiently.


(3) Step-by-Step Guide

As mentioned earlier in the Context section, we will build a web app that extracts and analyzes keywords of selected arXiv abstracts.

The following diagram illustrates how the data and tools are integrated.

Overview of project | Image by author

Let us get started with the steps to create the above pipeline and web application in Python.

Step 1 – Initial Setup

We start by pip installing the necessary Python libraries with corresponding versions shown below:


Step 2 – Setup Configuration File

As numerous parameters will be used, saving them inside a separate configuration file is ideal. The following YAML file config.yml contains the initial set of configuration parameter values.

<script src="https://gist.github.com/kennethleungty/f77f5a86e7f08c1438700d04d3b58548.js"></script>

With the configuration file set up, we can then easily import these parameter values into our other Python scripts with the following code:

with open('config.yml') as f:
    cfg = yaml.safe_load(f)

Step 3 – Build Functions

In this step, we will create a series of Python functions that form vital components of the pipeline. We create a new Python file functions.py to store these functions.

(3.1) Retrieve and Save arXiv Abstracts and Metadata

The first function to add into functions.py is one for retrieving text abstracts from the arXiv database using the arXiv API Python wrapper.

<script src="https://gist.github.com/kennethleungty/d4397e42a4e22bace76fe38c30c96229.js"></script>

Next, we write a function to store the abstract texts and corresponding metadata in a pandas DataFrame.

<script src="https://gist.github.com/kennethleungty/9f6e1fee8a3d0699758946191c7ae420.js"></script>

(3.2) Process Data

For the data processing step, we have the following function to parse the abstract publication date into the appropriate format while creating new empty columns to store keywords.

<script src="https://gist.github.com/kennethleungty/1bcc6594cdd42d833d72b672e782b817.js"></script>

(3.3) Run KeyBERT

We next create a function to run the KeyBert class from the KeyBERT library. The KeyBERT class is a minimal method for keyword extraction with BERT and is the easiest way for us to get started.

There are many different methods for generating the BERT embeddings (e.g., Flair, Huggingface Transformers, and spaCy). In this case, we will use sentence-transformers as recommended by the KeyBERT creator.

In particular, we will use the default[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model as it provides a good balance of speed and quality.

The following function extracts the keywords from each abstract iteratively and saves them in the new DataFrame columns created in the previous step.

<script src="https://gist.github.com/kennethleungty/f4f6967871270d5a128f50fba4111c4f.js"></script>

(3.4) Get Keywords Value Counts

Finally, we create a function that generates a value count of the keywords so that we can plot the keyword frequencies in a chart later.

<script src="https://gist.github.com/kennethleungty/6b2e8a6ac374e21bb69058e9adb26787.js"></script>

Step 4 – Setup Taipy Core: Backend Config

To orchestrate and link the backend pipeline flow, we will leverage the capabilities of Taipy Core.

Taipy Core offers an open-source framework to create, manage, and execute our data pipelines easily and efficiently. It has four fundamental concepts: Data Nodes, Tasks, Pipelines, and Scenarios.

Four fundamental concepts in Taipy Core | Image by author

To set up the backend, we will use configuration objects (from the Config class) to model and define the characteristics and desired behavior of the abovementioned concepts.


(4.1) Data Nodes

As with most Data Science projects, we start by handling the data. In Taipy Core, we use Data Nodes to define the data we will work with.

We can think of Data Nodes as Taipy's representation of data variables. However, instead of storing the data directly, Data Nodes contain a set of instructions on how to retrieve the data needed.

Data Nodes can read and write a wide range of data types, such as Python objects (e.g., str, int, list, dict, DataFrame, etc.), Pickle files, CSVs, SQL databases, and more.

Using the Config.configure_data_node() function, we define the Data Nodes for the keyword parameters based on the values from the configuration file in Step 2.

<script src="https://gist.github.com/kennethleungty/c418a2e7e55bbc813384e9c30ff113c9.js"></script>

The id parameter sets the name of the Data Node, while the default_data parameter defines the default values.

We next include the configuration objects for the five sets of data along the pipeline, as illustrated below:

Illustration of five Data Nodes along pipeline | Image by author

The following code defines the five configuration objects:

<script src="https://gist.github.com/kennethleungty/b8b61c9c743ed7baa523b8b405806ffe.js"></script>

(4.2) Tasks

Tasks in Taipy can be thought of as Python functions. We can define the configuration object for Tasks using the Config.configure_task().

We need to set five Task configuration objects corresponding to the five functions built in Step 3.

Illustration of the five Tasks | Image by author
<script src="https://gist.github.com/kennethleungty/da403910216a4a87c492c1961f4618cb.js"></script>

The input and output parameters refer to the input and output Data Nodes, respectively.

For example, in task_process_data_cfg, the input is the Data Node for the raw pandas DataFrame containing the arXiv search results, while the output is the Data Node for the DataFrame storing processed data.

The skippable parameter, when set to True, indicates that the Task can be skipped if no changes have been made to the inputs.

Here is the flowchart of the Data Nodes and Tasks we have defined so far:

Data Nodes and Tasks flowchart | Image by author

(4.3) Pipelines

A Pipeline is a series of Tasks that will be executed automatically by Taipy. It is a configuration object comprising a sequence of Task configuration objects.

In this case, we will allocate the five Tasks into two Pipelines (one for data preparation and one for keyword analysis) as illustrated below:

Tasks within the two pipelines | Image by author

We use the following code to define our two Pipeline configs:

<script src="https://gist.github.com/kennethleungty/417dc5740293fcd68e7dc3935926bac6.js"></script>

As with all configuration objects, we assign a name to these Pipeline configurations using the id parameter.


(4.4) Scenarios

In this project, we aim to create an application that reflects the updated set of Keywords (and corresponding analysis) based on changes made to input parameters (e.g., N-gram length).

For that to happen, we leverage the powerful concept of Scenarios. Taipy Scenarios provide the framework for running Pipelines under different conditions, such as when the user modifies the input parameters or data.

Scenarios also allow us to save the outputs from the different inputs for easy comparison within the same app interface.

Since we expect to do a straightforward sequential run of the Pipelines, we can place both Pipeline configs into the one Scenario configuration object.

<script src="https://gist.github.com/kennethleungty/2e07f55e83524d3914aa885d5704ee45.js"></script>

Step 5 – Setup Taipy GUI (Frontend)

Let us now switch gears and explore the frontend aspects of our application. Taipy GUI provides Python classes that make it easy to create powerful web app interfaces with text and graphical elements.

Pages are the basis for the user interface, and they hold text, images, or controls that display information in the application through visual elements.

There are two pages to create: (i) a keyword analysis dashboard page and (ii) a data viewer page to display the keywords DataFrame.

(5.1) Data Viewer

Taipy GUI can be considered an augmented Markdown, meaning we can use the Markdown syntax to build our frontend interface.

We start with the simple frontend page displaying the DataFrame of the extracted arXiv abstract data. The page is set up in a Python script (named data_viewer_md.py) and storing the Markdown in a variable (called data_page).

<script src="https://gist.github.com/kennethleungty/837019d99a8dfc726688a528a9bb4fca.js"></script>

The basic syntax for creating Taipy constructs in Markdown is using text fragments in the generic format of <|...|...|>.

In the above Markdown, we pass our DataFrame object df along with table, which indicates a table element. With just these few lines of code, we get an output like the following:

Screenshot of the Data Viewer page | Image by author

(5.2) Keyword Analysis Dashboard

We now move to the main dashboard page of the application, where we can make changes to the parameters and visualize the keywords obtained. The visual elements will be contained within a Python script (named analysis_md.py)

This page has numerous components, so let's take it one step at a time. First, we instantiate the parameter values upon the loading of the application.

<script src="https://gist.github.com/kennethleungty/c50a52baa474d7232a2d63992709520f.js"></script>

Next, we define the input segment of the page where users can make changes to parameters and scenarios. This segment will be saved in a variable called input_page, and will eventually look like this:

Input segment of the Keyword Analysis page | Image by author

We create a seven-column layout in the Markdown so that the input fields (e.g., text input, number input, dropdown menu selector) and buttons can be organized neatly.

<script src="https://gist.github.com/kennethleungty/072a38bf7aa59823020fdaca34e209d3.js"></script>

We will explain the callback functions in the on_change and on_action parameters for the elements above, so there is no need to worry about them for now.

After that, we define the output segment, where the frequency table and chart of the keywords based on the input parameters will be displayed.

Output segment of the Keyword Analysis page | Image by author

We will define the chart properties in addition to specifying the Markdown of the output segment in the variable output_page.

<script src="https://gist.github.com/kennethleungty/ce6a7c3dbefc93147ed69f1441ae95db.js"></script>

And in the last line above, we combine both input and output segments into a single variable called analysis_page.


(5.3) Main Landing Page

One last bit before our frontend interface is complete. Now that we have both pages ready, we shall display them on our main landing page.

The main page is defined within main.py, which is the script that will be run when the application is launched. The aim is to create a functional menu bar on the main page for users to toggle between the pages.

<script src="https://gist.github.com/kennethleungty/803237bc47fa1c2699228240d63c66a0.js"></script>

From the above code, we can see the state functionality of Taipy in action, where the page is rendered based on the selected page in the session state.


Step 6— Linking Backend and Frontend with Scenarios

At this point, our frontend interface and backend pipeline have been set up successfully. However, we have yet to link both of them together.

More specifically, we will need to create the Scenarios component so that variations in the input parameters are processed in the pipeline, and the output is reflected in the dashboard.

The added benefit of Scenarios is that every input-output set can be saved so that users can refer back to these previous configurations.

We will define four functions to set up the Scenarios component, which will be stored in the analysis_md.py script:

(6.1) Update Chart

This function updates the keywords DataFrame, frequency count table, and corresponding bar chart based on the input parameters of the selected Scenario stored in the session state.

<script src="https://gist.github.com/kennethleungty/d39fd0a0a366ff9b437e43da289eb635.js"></script>

(6.2) Submit Scenario

This function registers the updated set of input parameters the user has modified as a scenario and passes the values through the pipeline.

<script src="https://gist.github.com/kennethleungty/5d6c841cc39cc8208f488ca72f1cf90a.js"></script>

(6.3) Create Scenario

This function saves a scenario that has been executed so that it can be easily recreated and referred to again from the dropdown menu of created Scenarios.

<script src="https://gist.github.com/kennethleungty/969d5ebb7aadd4ac13001d6d240b3e14.js"></script>

(6.4) Synchronize GUI and Core

This function retrieves input parameters from a Scenario selected from the dropdown menu of saved Scenarios and displays the resulting output in the frontend GUI.

<script src="https://gist.github.com/kennethleungty/41aafda3ad3c32b96b5048955d174071.js"></script>

Step 7— Launching the Application

In the last step, we wrap up by completing the code in main.py so that the Taipy launches and runs correctly when the script is executed.

<script src="https://gist.github.com/kennethleungty/bb362969b66d9cec906b479400638249.js"></script>

The above code does the following steps:

  • Instantiate Taipy Core
  • Setup scenario creation and execution
  • Retrieve keywords DataFrame and frequency count table
  • Launch Taipy GUI (with the specified pages)

Finally, we can run python main.py in the Command Line, and the application we have built will be accessible on localhost:8020.

Frontend interface of completed application | Image by author

(4) Wrapping it up

The keywords associated with a document offer concise and comprehensive indications of its subject matter, highlighting the most important themes, concepts, ideas, or arguments contained therein.

In this article, we explored how to extract and analyze keywords of arXiv abstracts using KeyBERT and Taipy. We also discovered how to deliver these capabilities as a web application comprising a frontend user interface and a backend pipeline.

Feel free to check out the codes in the accompanying GitHub repo.

Before you go

I welcome you to join me on a journey of data science discovery! Follow this Medium page and visit my GitHub to stay updated with more engaging and practical content. Meanwhile, have fun building your keyword extraction and analysis pipeline with KeyBERT and Taipy!

When AI Goes Astray: High-Profile Machine Learning Mishaps in the Real World

How to Web Scrape Wikipedia with LLM Agents

Tags: Data Science Keywords Machine Learning NLP Python

Comment