arXiv Keyword Extraction and Analysis Pipeline with KeyBERT and Taipy

As the amount of textual data from sources like social media, customer reviews, and online platforms grows exponentially, we must be able to make sense of this unstructured data.
Keyword extraction and analysis are powerful natural language processing (NLP) techniques that enable us to achieve that.
Keyword extraction involves automatically identifying and extracting the most relevant words from a given text, while keyword analysis involves analyzing the keywords to gain insights into the underlying patterns.
In this step-by-step guide, we explore building a keyword extraction and analysis pipeline and web app on arXiv abstracts using the powerful tools of KeyBERT and Taipy.
Contents
(1) Context (2) Tools Overview (3) Step-by-Step Guide (4) Wrapping it up
Here is the accompanying GitHub repo for this article.
(1) Context
Given the rapid progress in artificial intelligence (AI) and machine learning research, keeping track of the many papers published daily can be challenging.
Regarding such research, arXiv is undoubtedly one of the leading sources of information. arXiv (pronounced ‘archive') is an open-access archive hosting a vast collection of scientific papers covering various disciplines like computer science, mathematics, and more.

One of the key features of arXiv is that it provides abstracts for each paper uploaded to its platform. These abstracts are an ideal data source as they are concise, rich in technical vocabulary, and contain domain-specific terminology.
Hence, we will utilize the latest batches of arXiv abstracts as the text data to work on in this project.
The goal is to create a web application (comprising a frontend interface and backend pipeline) where users can view the keywords and key phrases of arXiv abstracts based on specific input values.

(2) Tools Overview
There are three main tools that we will use in this project:
- arXiv API Python wrapper
- KeyBERT
- Taipy
(i) arXiv API Python wrapper
The arXiv website offers public API access to maximize its openness and interoperability. For example, to retrieve the text abstracts as part of our Python workflow, we can use the Python wrapper for the arXiv API.
The arXiv API Python wrapper provides a set of functions for searching the database for papers that match specific criteria, such as author, keyword, category, and more.
It also lets users retrieve detailed metadata about each paper, such as the title, abstract, authors, and publication date.
(ii) KeyBERT
KeyBERT (from the terms ‘keyword' and ‘BERT') is a Python library that provides an easy-to-use interface for using BERT embeddings and cosine similarity to extract the words in a document most representative of the document itself.

The biggest strength of KeyBERT is its flexibility. It allows users to easily modify the underlying settings (e.g., parameters, embeddings, tokenization) to experiment and fine-tune the keywords obtained.
In this project, we will be tuning the following set of parameters:
- Number of the top keywords to be returned
- Word n-gram range (i.e., minimum and maximum n-gram length)
- Diversification algorithm (Max Sum Distance or Maximal Marginal Relevance) that determines how the similarity of extracted keywords is defined
- Number of candidates (if Max Sum Distance is set)
- Diversity value (if Maximal Marginal Relevance is set)
Both diversification algorithms (Max Sum Distance and Maximal Marginal Relevance) share the same basic idea of balancing two objectives: Retrieve results that are highly relevant to the query and yet are diverse in their content to avoid redundancy amongst each other.
(iii) Taipy
Taipy is an open-source Python application builder that quickly lets developers and data scientists turn data and machine learning algorithms into complete web applications.
While designed to be a low-code library, Taipy also provides a high level of user customization. Therefore, it is well-suited for wide-ranging use cases, from simple dashboarding to production-ready industrial applications.

There are two key components of Taipy: Taipy GUI and Taipy Core.
- Taipy GUI: A simple graphical user interface builder enabling us to easily create an interactive frontend app interface.
- Taipy Core: A modern backend framework that lets us efficiently build and execute pipelines and scenarios.
While we can use Taipy GUI or Taipy Core independently, combining both allows us to build powerful applications efficiently.
(3) Step-by-Step Guide
As mentioned earlier in the Context section, we will build a web app that extracts and analyzes keywords of selected arXiv abstracts.
The following diagram illustrates how the data and tools are integrated.

Let us get started with the steps to create the above pipeline and web application in Python.
Step 1 – Initial Setup
We start by pip installing the necessary Python libraries with corresponding versions shown below:
Step 2 – Setup Configuration File
As numerous parameters will be used, saving them inside a separate configuration file is ideal. The following YAML file config.yml
contains the initial set of configuration parameter values.
With the configuration file set up, we can then easily import these parameter values into our other Python scripts with the following code:
with open('config.yml') as f:
cfg = yaml.safe_load(f)
Step 3 – Build Functions
In this step, we will create a series of Python functions that form vital components of the pipeline. We create a new Python file functions.py
to store these functions.
(3.1) Retrieve and Save arXiv Abstracts and Metadata
The first function to add into functions.py
is one for retrieving text abstracts from the arXiv database using the arXiv API Python wrapper.
Next, we write a function to store the abstract texts and corresponding metadata in a pandas DataFrame.
(3.2) Process Data
For the data processing step, we have the following function to parse the abstract publication date into the appropriate format while creating new empty columns to store keywords.
(3.3) Run KeyBERT
We next create a function to run the KeyBert
class from the KeyBERT library. The KeyBERT
class is a minimal method for keyword extraction with BERT and is the easiest way for us to get started.
There are many different methods for generating the BERT embeddings (e.g., Flair, Huggingface Transformers, and spaCy). In this case, we will use sentence-transformers as recommended by the KeyBERT creator.
In particular, we will use the default[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
model as it provides a good balance of speed and quality.
The following function extracts the keywords from each abstract iteratively and saves them in the new DataFrame columns created in the previous step.
(3.4) Get Keywords Value Counts
Finally, we create a function that generates a value count of the keywords so that we can plot the keyword frequencies in a chart later.
Step 4 – Setup Taipy Core: Backend Config
To orchestrate and link the backend pipeline flow, we will leverage the capabilities of Taipy Core.
Taipy Core offers an open-source framework to create, manage, and execute our data pipelines easily and efficiently. It has four fundamental concepts: Data Nodes, Tasks, Pipelines, and Scenarios.

To set up the backend, we will use configuration objects (from the Config
class) to model and define the characteristics and desired behavior of the abovementioned concepts.
(4.1) Data Nodes
As with most Data Science projects, we start by handling the data. In Taipy Core, we use Data Nodes to define the data we will work with.
We can think of Data Nodes as Taipy's representation of data variables. However, instead of storing the data directly, Data Nodes contain a set of instructions on how to retrieve the data needed.
Data Nodes can read and write a wide range of data types, such as Python objects (e.g., str
, int
, list
, dict
, DataFrame
, etc.), Pickle files, CSVs, SQL databases, and more.
Using the Config.configure_data_node()
function, we define the Data Nodes for the keyword parameters based on the values from the configuration file in Step 2.
The id
parameter sets the name of the Data Node, while the default_data
parameter defines the default values.
We next include the configuration objects for the five sets of data along the pipeline, as illustrated below:

The following code defines the five configuration objects:
(4.2) Tasks
Tasks in Taipy can be thought of as Python functions. We can define the configuration object for Tasks using the Config.configure_task()
.
We need to set five Task configuration objects corresponding to the five functions built in Step 3.

The input
and output
parameters refer to the input and output Data Nodes, respectively.
For example, in task_process_data_cfg
, the input is the Data Node for the raw pandas DataFrame containing the arXiv search results, while the output is the Data Node for the DataFrame storing processed data.
The skippable
parameter, when set to True, indicates that the Task can be skipped if no changes have been made to the inputs.
Here is the flowchart of the Data Nodes and Tasks we have defined so far:

(4.3) Pipelines
A Pipeline is a series of Tasks that will be executed automatically by Taipy. It is a configuration object comprising a sequence of Task configuration objects.
In this case, we will allocate the five Tasks into two Pipelines (one for data preparation and one for keyword analysis) as illustrated below:

We use the following code to define our two Pipeline configs:
As with all configuration objects, we assign a name to these Pipeline configurations using the id
parameter.
(4.4) Scenarios
In this project, we aim to create an application that reflects the updated set of Keywords (and corresponding analysis) based on changes made to input parameters (e.g., N-gram length).
For that to happen, we leverage the powerful concept of Scenarios. Taipy Scenarios provide the framework for running Pipelines under different conditions, such as when the user modifies the input parameters or data.
Scenarios also allow us to save the outputs from the different inputs for easy comparison within the same app interface.
Since we expect to do a straightforward sequential run of the Pipelines, we can place both Pipeline configs into the one Scenario configuration object.
Step 5 – Setup Taipy GUI (Frontend)
Let us now switch gears and explore the frontend aspects of our application. Taipy GUI provides Python classes that make it easy to create powerful web app interfaces with text and graphical elements.
Pages are the basis for the user interface, and they hold text, images, or controls that display information in the application through visual elements.
There are two pages to create: (i) a keyword analysis dashboard page and (ii) a data viewer page to display the keywords DataFrame.
(5.1) Data Viewer
Taipy GUI can be considered an augmented Markdown, meaning we can use the Markdown syntax to build our frontend interface.
We start with the simple frontend page displaying the DataFrame of the extracted arXiv abstract data. The page is set up in a Python script (named data_viewer_md.py
) and storing the Markdown in a variable (called data_page)
.
The basic syntax for creating Taipy constructs in Markdown is using text fragments in the generic format of <|...|...|>
.
In the above Markdown, we pass our DataFrame object df
along with table
, which indicates a table element. With just these few lines of code, we get an output like the following:

(5.2) Keyword Analysis Dashboard
We now move to the main dashboard page of the application, where we can make changes to the parameters and visualize the keywords obtained. The visual elements will be contained within a Python script (named analysis_md.py
)
This page has numerous components, so let's take it one step at a time. First, we instantiate the parameter values upon the loading of the application.
Next, we define the input segment of the page where users can make changes to parameters and scenarios. This segment will be saved in a variable called input_page
, and will eventually look like this:

We create a seven-column layout in the Markdown so that the input fields (e.g., text input, number input, dropdown menu selector) and buttons can be organized neatly.
We will explain the callback functions in the
on_change
andon_action
parameters for the elements above, so there is no need to worry about them for now.
After that, we define the output segment, where the frequency table and chart of the keywords based on the input parameters will be displayed.

We will define the chart properties in addition to specifying the Markdown of the output segment in the variable output_page
.
And in the last line above, we combine both input and output segments into a single variable called analysis_page
.
(5.3) Main Landing Page
One last bit before our frontend interface is complete. Now that we have both pages ready, we shall display them on our main landing page.
The main page is defined within main.py
, which is the script that will be run when the application is launched. The aim is to create a functional menu bar on the main page for users to toggle between the pages.
From the above code, we can see the state functionality of Taipy in action, where the page is rendered based on the selected page in the session state.
Step 6— Linking Backend and Frontend with Scenarios
At this point, our frontend interface and backend pipeline have been set up successfully. However, we have yet to link both of them together.
More specifically, we will need to create the Scenarios component so that variations in the input parameters are processed in the pipeline, and the output is reflected in the dashboard.
The added benefit of Scenarios is that every input-output set can be saved so that users can refer back to these previous configurations.
We will define four functions to set up the Scenarios component, which will be stored in the analysis_md.py
script:
(6.1) Update Chart
This function updates the keywords DataFrame, frequency count table, and corresponding bar chart based on the input parameters of the selected Scenario stored in the session state.
(6.2) Submit Scenario
This function registers the updated set of input parameters the user has modified as a scenario and passes the values through the pipeline.
(6.3) Create Scenario
This function saves a scenario that has been executed so that it can be easily recreated and referred to again from the dropdown menu of created Scenarios.
(6.4) Synchronize GUI and Core
This function retrieves input parameters from a Scenario selected from the dropdown menu of saved Scenarios and displays the resulting output in the frontend GUI.
Step 7— Launching the Application
In the last step, we wrap up by completing the code in main.py
so that the Taipy launches and runs correctly when the script is executed.
The above code does the following steps:
- Instantiate Taipy Core
- Setup scenario creation and execution
- Retrieve keywords DataFrame and frequency count table
- Launch Taipy GUI (with the specified pages)
Finally, we can run python main.py
in the Command Line, and the application we have built will be accessible on localhost:8020
.

(4) Wrapping it up
The keywords associated with a document offer concise and comprehensive indications of its subject matter, highlighting the most important themes, concepts, ideas, or arguments contained therein.
In this article, we explored how to extract and analyze keywords of arXiv abstracts using KeyBERT and Taipy. We also discovered how to deliver these capabilities as a web application comprising a frontend user interface and a backend pipeline.
Feel free to check out the codes in the accompanying GitHub repo.
Before you go
I welcome you to join me on a journey of data science discovery! Follow this Medium page and visit my GitHub to stay updated with more engaging and practical content. Meanwhile, have fun building your keyword extraction and analysis pipeline with KeyBERT and Taipy!
When AI Goes Astray: High-Profile Machine Learning Mishaps in the Real World