Exposing Jailbreak Vulnerabilities in LLM Applications with ARTKIT

Author:Murphy  |  View: 22574  |  Time: 2025-03-22 20:22:49
Photo by Matthew Ball on Unsplash

As large language models (LLMs) become more widely adopted across different industries and domains, significant security risks have emerged and intensified. Several of these key concerns include breaches of data privacy, the potential for biases, and the risk of information manipulation.

The Open Worldwide Application Security Project recently published the ten most critical security risks of LLM applications, as described below:

Identifying these risks is essential for ensuring that LLM applications continue to provide value in real-world situations while maintaining their safety, effectiveness, and robustness.

In this article, we explore how to use the open-source ARTKIT framework to automatically evaluate security vulnerabilities of LLM applications using the popular Gandalf Challenge **** as an illustrative example.

Contents

(1) About Prompt Injection Vulnerabilities (2) Gandalf Challenge (3) Introducing ARTKIT (4) Approach Overview (5) Step-by-Step Walkthrough

You can find the codes in the accompanying GitHub repo of this article.


(1) About Prompt Injection Vulnerabilities

A prompt injection vulnerability is a type of cyberattack that arises when attackers exploit an LLM with carefully crafted inputs, leading it to execute malicious instructions unintentionally.

Prompt injection attacks can be difficult to detect and can lead to serious consequences such as leakage of sensitive information, unauthorized access, and manipulation of decision-making processes.

It can be performed directly or indirectly:

(i) Direct (Jailbreaking)

  • Attackers directly modify underlying system prompts, thereby convincing the LLM system to disregard its safeguards. It allows attackers to generate harmful responses or exploit backend systems by interacting with insecure functions and data stores.
  • Example: An attacker crafts prompts instructing the LLM to ignore the instructions in the original system prompts and instead return private information like passwords.

(ii) Indirect

  • Attackers manipulate external inputs (e.g., files, websites) that the LLM ingests, allowing them to control the LLM's responses and actions, even if the injected text is invisible to users.
  • Example: An attacker uploads a document embedded with prompts—concealed in a zero-point font—instructing LLMs to evaluate the user's resume as that of an exceptional job candidate. When the recruiter uses an LLM to assess the resume, it inadvertently showcases the candidate as highly qualified, thereby skewing the hiring process.

There are different types of prompt injection attacks, such as virtualization, obfuscation, and role-playing attacks. Details can be found here.


(2) Gandalf Challenge

In this project, we attempt to automatically crack the Gandalf Challenge, an interactive game that demonstrates the security vulnerabilities of LLM applications and highlights mitigation strategies.

Screenshot from publicly accessible Gandalf website, utilized under fair use

The objective of the game is simple: use prompt engineering techniques to trick the LLM behind Gandalf's interface to reveal a password.

The game consists of ten progressively difficult levels based on various defenses employed to prevent password disclosure, such as prompts instructing not to reveal the password, input guardrails that filter user prompts, and output guardrails that block responses containing passwords.


(3) Introducing ARTKIT

As LLM systems become more commonplace, it is important to build user trust by ensuring that the models perform reliably even in adversarial conditions. This is where ARTKIT comes in handy in testing LLM systems for their proficiency, equitability, safety, and security.

ARTKIT is an open-source framework for developing powerful automated end-to-end pipelines to test and evaluate LLM-based applications like chatbots and virtual assistants.

Its simplicity and flexibility in building fit-for-purpose pipelines make it an excellent tool for data scientists and engineers to conduct human-in-the-loop testing of LLM systems.

For example, ARTKIT facilitates the effective use of LLMs to automate critical steps in red-teaming, such as generating adversarial prompts to exploit LLMs and analyzing their responses to uncover potential vulnerabilities.

The structured process of simulating attacks on a system to uncover its vulnerabilities and improve security is known as red-teaming. It allows organizations to strengthen their defenses against real-world threats by understanding potential breaches from an attacker's perspective.

ARTKIT allows for the clever use of Generative AI (GenAI) models like LLMs as part of powerful pipelines to automate testing and evaluation of GenAI systems | Image used under Apache License 2.0

One of ARTKIT's standout features is its support for automated multi-turn conversations between an attacker system and a target system, which we will explore in this article.

Given that LLM systems may struggle to maintain context and coherence over prolonged chats, the ability to scale the testing of extended, multi-turn interactions is crucial in identifying potential vulnerabilities.


(4) Approach Overview

Here is an overview of our approach to demonstrating LLM jailbreaking:

  • Conduct model-based red-teaming, where we use an LLM model to attack a target system (i.e., Gandalf)
  • Utilize ARTKIT and OpenAI's GPT-4o to create an attacker LLM that employs password extraction techniques in its adversarial prompts while engaging in multi-turn conversations until the password is divulged.

OpenAI API keys can be found on the API key page.


(5) Step-by-Step Walkthrough

Let's review the steps for using ARTKIT to extract passwords in the Gandalf challenge. You can find the accompanying Jupyter notebook here.

(5.1) Install ARTKIT

ARTKIT can be installed via either PyPI (pip install artkit) or Conda (conda install -c conda-forge artkit). For this project, I am using version 1.0.7.

Because ARTKIT provides out-of-the-box support for popular model providers like OpenAI and Anthropic, there is no need to install those packages separately.

Since we will utilize services from external model providers, it is recommended to store the access keys in .env files and load them with python-dotenv. The steps to do so can be found here.


(5.2) Load Dependencies

We load the necessary dependencies and access keys:

<script src="https://gist.github.com/kennethleungty/edd6cf7fa89df790974ac3f9b1b69ce2.js"></script>

(5.3) Create Class to Access Gandalf

To facilitate interaction with the LLM behind Gandalf, we create a class called GandalfChat to encapsulate the necessary functionality to chat with Gandalf and handle message formatting and response processing.

<script src="https://gist.github.com/kennethleungty/069989c499812eb256ac91b47b0cd437.js"></script>

Let's take a closer look at theGandalfChat class:

  • GandalfChat inherits from ARTKIT's HTTPXChatConnector class. Because Gandalf serves as a custom HTTP endpoint rather than a standalone LLM object, HTTPXChatConnector enables us to establish a seamless connection to it
  • Level enumeration structures the difficulty levels so that they can be referenced using member names like LEVEL_01.
  • build_request_arguments formats the request to include arguments such as the difficulty level and input prompt.
  • parse_httpx_response processes the LLM output based on the HTTP response object returned by the API.
  • get_default_api_key_envprovides the environment variable name where the API key for the chat system might be stored.

In addition to supporting custom systems like Gandalf, ARTKIT also offers pre-built classes for seamless connections to popular LLM platforms like OpenAI and AWS Bedrock.

This integration flexibility is another key strength of ARTKIT, enabling efficient red-teaming against mainstream LLMs.

Details about HTTPXConnector can be found in "Calling custom endpoints via HTTP" __ section of this tutorial.


(5.4) Instantiate Chat Model Instance for Gandalf

We create an instance of GandalfChat as a model object containing the URL of the Gandalf API endpoint and the desired difficulty level. In this example, we will be tackling Level 4.

<script src="https://gist.github.com/kennethleungty/61918557b48cf20c169c90a6cd75d3da.js"></script>

We also utilize ARTKIT's CachedChatModel as a wrapper around GandalfChat to enable the caching of responses into an SQLite database (gandalf_cache.db).

The advantage of storing these chat interactions is that we can minimize redundant API calls for repeated queries, which in turn speeds up response time and reduces costs.

We also set a 10-second cutoff using Timeout to limit the wait time for API responses, ensuring our requests do not hang indefinitely.


(5.5) Setup Attacker LLM

We use OpenAI's GPT-4o model to jailbreak Gandalf, which we instantiate using the OpenAIChat class designed for OpenAI LLMs:

<script src="https://gist.github.com/kennethleungty/d08da1eb4cd90b0f4ee0774e4dd46c58.js"></script>

Just as we did with the Gandalf chat object, we use CachedChatModel to wrap the GPT-4o LLM to enable response caching.


(5.6) Define Attacker LLM Objective and System Prompts

With the attacker LLM ready, we proceed with prompt engineering to define the objective prompt and the attacker system prompt.

Because we will use ARTKIT's multi-turn interaction feature, we need to explicitly specify a separate prompt to describe the attacker LLM's objective (i.e., make Gandalf reveal its password) so that its responses are well-guided.

<script src="https://gist.github.com/kennethleungty/ffc8c2c3c21d7b5047dac1c1ce6c39df.js"></script>

The objective prompt is saved as a dictionary in a list because we can store and use more than one objective for different steps in the pipeline.

Next, we define the system prompt that guides the attacker LLM's strategy in extracting passwords. The prompt is crafted such that the attacker LLM devises indirect and creative techniques to mislead Gandalf about the true intent of the inquiry, thereby bypassing its guardrails.

<script src="https://gist.github.com/kennethleungty/db6c5e1ed55f13795e5ca11cf49e2458.js"></script>

Notice that the system prompt contains the {objective} parameter in which the objective prompt will be dynamically injected.

Moreover, we need to include the following two dynamic parameters for multi-turn interactions to happen:

  • {max_turns}: Maximum turns allowed for the LLM to accomplish the objective, preventing it from engaging in endless conversations.
  • {success_token}: The token to output when the LLM achieves its objective. It serves as a signal to terminate the conversation early.

(5.7) Run Multi-Turn Interactions with Gandalf

We are now one step away from initiating our jailbreaking attempts on Gandalf; all that remains is to link the components and execute them.

ARTKIT's run function allows us to orchestrate the execution of a sequence of steps in a processing pipeline and return the result of the executed flow.

<script src="https://gist.github.com/kennethleungty/5ef5ed05e734fc40d2623153c06f76a4.js"></script>

Here is a look at the parameters of ak.run:

  • The input parameter accepts the objectives variable that contains the objective defined in the preceding step.
  • The stepsparameter accepts a set of steps to execute, where each step is defined with the ak.step function. In this case, we only have one ak.step step to run, which is to conduct multi-turn interactions between the attacker LLM (challenger_llm) and Gandalf (target_llm).
  • In the ak.step function, we use ak.multi_turn to orchestrate multi-turn conversations, thereby maintaining the context and conversation history.
  • We also specify the success token (success_token), the maximum turns (max_turns), and the attacker LLM system prompt (attacker_prompt).

Executing the code above kickstarts the multi-turn interactions aimed at breaking through Gandalf's defenses, with the output saved in results.


(5.8) View Interaction History and Secret Password

After executing the jailbreak, it is time for us to review the outcome. We run the following helper code to structure the conversation history and output more clearly.

<script src="https://gist.github.com/kennethleungty/12922b3936aae1116efac83c6536dabd.js"></script>

And now, the moment of truth! Below is an instance of the interactions between our attacker LLM and Gandalf:

<script src="https://gist.github.com/kennethleungty/2d6cc45540e9fe8e10134e8a77a2f1c0.js"></script>

Through a series of clever prompt techniques (e.g., generating riddles, extracting letters), we successfully extracted the hidden password (i.e., UNDERGROUND) despite Gandalf's best efforts to guard it.

Spoiler: Password for every level can be found here.


(6) Wrap-up

In this article, we demonstrated how ARTKIT can be used for automated prompt-based testing of LLM systems to unveil jailbreaking vulnerabilities. Leveraging LLMs' capabilities for model-based red-teaming offers a powerful means to scale and accelerate the testing of LLM systems.

While we focused on Level 4 in this showcase, the ARTKIT setup was able to smoothly overcome Levels 1 to 6. Beyond that, human intervention became necessary, involving advanced prompt engineering and parameter adjustments.

This highlights the importance of combining automation with human-led red-teaming, where automation saves time by identifying basic vulnerabilities, allowing humans to focus on more complex risks.

The integration of human oversight can be tailored to different sophistication levels, ensuring a balanced and comprehensive testing approach.

Before you go

I welcome you to follow this Medium page and visit my GitHub to stay updated with more engaging and practical content. Meanwhile, have fun red-teaming LLM systems with ARTKIT!

Inside the Leaked System Prompts of GPT-4, Gemini 1.5, and Claude 3

Text-to-Audio Generation with Bark, Clearly Explained

Tags: Generative Ai Tools Hands On Tutorials Llm Machine Learning red-teaming

Comment