How to automate entity extraction from PDF using LLMs

Author:Murphy  |  View: 29901  |  Time: 2025-03-23 18:21:40
Photo by Google DeepMind on Unsplash

The need for high-quality labeled data cannot be overstated in modern machine learning applications. From improving our models' performance to ensuring fairness, the power of labeled data is immense. Unfortunately, the time and effort required to create such datasets are equally significant. But what if we could reduce the time spent on this task from days to mere hours while maintaining or even enhancing the labeling quality? A utopian dream? Not anymore.

Emerging paradigms in machine learning – Zero-Shot Learning, Few-Shot Learning, and Model-Assisted Labeling – present a transformative approach to this crucial process. These techniques harness the power of advanced algorithms, reducing the need for extensive labeled datasets, and enabling faster, more efficient, and highly effective data annotation.

In this tutorial, we are going to present a method to auto-label unstructured and semi-structured documents using Large Language Model's (LLM) in-context learning capabilities.

Information extraction from SDS

Unlike traditional supervised models that require extensive labeled data to get trained on solving a specific task, LLMs can generalize and extrapolate information from a few examples by tapping into its large knowledge base. This emergent capability, knows as in-context learning, makes LLM a versatile choice for many tasks that includes not only text generation but also data extraction such as named entity recognition.

For this tutorial, we are going to label Safety Data Sheets (SDS) from various companies using zero-shot and few-shot labeling capabilities of GPT 3.5, also known as ChatGPT. SDS offer comprehensive information regarding specific substances or mixtures, designed to assist workplaces in effectively managing chemicals. These documents play a vital role in providing detailed insights into hazards, encompassing environmental risks, and offering invaluable guidance on safety precautions. SDSs act as an indispensable source of knowledge, enabling employees to make informed decisions regarding the safe handling and utilization of chemicals in the workplace. SDS usually come in PDFs in various layouts but usually contain the same information. In this tutorial, we are interested to train an AI model that automatically identifies the following entities:

  • Product number
  • CAS number
  • Use cases
  • Classification
  • GHS label
  • Formula
  • Molecular weight
  • Synonym
  • Emergency phone number
  • First aid measures
  • Component
  • Brand

Extracting this relevant information and storing it in a searchable database is very valuable for many companies since it allows the search and retrieval of hazardous components very quickly. Here is an example of an SDS:

Publicly available SDS. Image by Author

Zero-shot Labeling

Unlike text generation, information extraction is a much challenging tasks for LLMs to do. LLMs have been trained for text completion tasks and usually tend to hallucinate or generate additional comments or text when prompted to extract relevant information.

In order to correctly parse the result of the LLM, we need to have a consistent output from the LLM such as a JSON. Which requires some prompt engineering to get it right. In addition, once the results are parsed we need to map them to the original tokens in the input text.

Fortunately, all these steps have been done and abstracted away using UBIAI annotation tool. Under hood, UBIAI does the prompting, chunk the data so it is below the context length limit, and send it to OpenAI's GPT3.5 Turbo API for inference. Once the output is sent back, the data gets parsed, processed and applied to your documents for auto-labeling.

To get started, simply upload your documents, whether its in native Pdf, image, or a simple Docx, then go to the annotation page and select the Few-shot tab in the annotation interface:

UBIAI Few-shot dashboard. Image by Author

For more details, checkout the documentation here: https://ubiai.gitbook.io/ubiai-documentation/zero-shot-and-few-shot-labeling

UBIAI enables you to configure the number of examples that you would like the model to learn from to auto-label the next documents. The app will automatically choose the most informative documents from your already labeled dataset and concatenate them in the prompt. This approach is called Few-shot labeling where "Few" ranges from 0 to n. To configure, the number of examples, simply click on the configuration button and input the number of examples, as shown below.

UBIAI Few-shot configuration window. Image by Author

For this tutorial, we are going to provide zero examples to the LLM to learn from and ask it to label the data based purely on the description of the entity itself. Surprisingly, the LLM is able to understand our document quite well and does most of the labeling correctly!

Below is the result of zero-shot labeling on the SDS PDF without any examples, quite impressive!

Zero-shot labeling using UBIAI. Image by Author

Conclusion

Automating entity extraction from PDFs using Large Language Models (LLMs) has become a reality with the advent of LLMs in-context learning capabilities such as Zero-Shot Learning and Few-Shot Learning. These techniques harness the power of LLMs latent knowledge to reduce the reliance on extensive labeled datasets and enable faster, more efficient, and highly effective data annotation.

The tutorial presented a method to auto-label semi-structured documents, specifically focusing on Safety Data Sheets (SDS) but would also work for unstructured text. By leveraging the in-context learning capabilities of LLMs, particularly GPT 3.5 (chatGPT), the tutorial demonstrated the ability to automatically identify important entities within SDSs, such as product number, CAS number, use cases, classification, GHS label, and more.

The extracted information, if stored in a searchable database, provides significant value to companies as it allows for quick search and retrieval of hazardous components. The tutorial highlighted the potential of zero-shot labeling, where the LLM can understand and extract information from SDSs without any explicit examples. This showcases the versatility and generalization abilities of LLMs, going beyond text generation tasks.

If you are interested to create your own training dataset using LLMs zero-shot capabilities, schedule a demo with us here.

Follow us on Twitter @UBIAI5 !

Tags: ChatGPT Data Labeling Machine Learning Naturallanguageprocessing Pdf

Comment