Exposing Jailbreak Vulnerabilities in LLM Applications with ARTKIT

As large language models (LLMs) become more widely adopted across different industries and domains, significant security risks have emerged and intensified. Several of these key concerns include breaches of data privacy, the potential for biases, and the risk of information manipulation.
The Open Worldwide Application Security Project recently published the ten most critical security risks of LLM applications, as described below:
Identifying these risks is essential for ensuring that LLM applications continue to provide value in real-world situations while maintaining their safety, effectiveness, and robustness.
In this article, we explore how to use the open-source ARTKIT framework to automatically evaluate security vulnerabilities of LLM applications using the popular Gandalf Challenge **** as an illustrative example.
Contents
(1) About Prompt Injection Vulnerabilities (2) Gandalf Challenge (3) Introducing ARTKIT (4) Approach Overview (5) Step-by-Step Walkthrough
You can find the codes in the accompanying GitHub repo of this article.
(1) About Prompt Injection Vulnerabilities
A prompt injection vulnerability is a type of cyberattack that arises when attackers exploit an LLM with carefully crafted inputs, leading it to execute malicious instructions unintentionally.
Prompt injection attacks can be difficult to detect and can lead to serious consequences such as leakage of sensitive information, unauthorized access, and manipulation of decision-making processes.
It can be performed directly or indirectly:
(i) Direct (Jailbreaking)
- Attackers directly modify underlying system prompts, thereby convincing the LLM system to disregard its safeguards. It allows attackers to generate harmful responses or exploit backend systems by interacting with insecure functions and data stores.
- Example: An attacker crafts prompts instructing the LLM to ignore the instructions in the original system prompts and instead return private information like passwords.
(ii) Indirect
- Attackers manipulate external inputs (e.g., files, websites) that the LLM ingests, allowing them to control the LLM's responses and actions, even if the injected text is invisible to users.
- Example: An attacker uploads a document embedded with prompts—concealed in a zero-point font—instructing LLMs to evaluate the user's resume as that of an exceptional job candidate. When the recruiter uses an LLM to assess the resume, it inadvertently showcases the candidate as highly qualified, thereby skewing the hiring process.
There are different types of prompt injection attacks, such as virtualization, obfuscation, and role-playing attacks. Details can be found here.
(2) Gandalf Challenge
In this project, we attempt to automatically crack the Gandalf Challenge, an interactive game that demonstrates the security vulnerabilities of LLM applications and highlights mitigation strategies.

The objective of the game is simple: use prompt engineering techniques to trick the LLM behind Gandalf's interface to reveal a password.
The game consists of ten progressively difficult levels based on various defenses employed to prevent password disclosure, such as prompts instructing not to reveal the password, input guardrails that filter user prompts, and output guardrails that block responses containing passwords.
(3) Introducing ARTKIT
As LLM systems become more commonplace, it is important to build user trust by ensuring that the models perform reliably even in adversarial conditions. This is where ARTKIT comes in handy in testing LLM systems for their proficiency, equitability, safety, and security.
ARTKIT is an open-source framework for developing powerful automated end-to-end pipelines to test and evaluate LLM-based applications like chatbots and virtual assistants.
Its simplicity and flexibility in building fit-for-purpose pipelines make it an excellent tool for data scientists and engineers to conduct human-in-the-loop testing of LLM systems.
For example, ARTKIT facilitates the effective use of LLMs to automate critical steps in red-teaming, such as generating adversarial prompts to exploit LLMs and analyzing their responses to uncover potential vulnerabilities.
The structured process of simulating attacks on a system to uncover its vulnerabilities and improve security is known as red-teaming. It allows organizations to strengthen their defenses against real-world threats by understanding potential breaches from an attacker's perspective.

One of ARTKIT's standout features is its support for automated multi-turn conversations between an attacker system and a target system, which we will explore in this article.
Given that LLM systems may struggle to maintain context and coherence over prolonged chats, the ability to scale the testing of extended, multi-turn interactions is crucial in identifying potential vulnerabilities.
(4) Approach Overview
Here is an overview of our approach to demonstrating LLM jailbreaking:
- Conduct model-based red-teaming, where we use an LLM model to attack a target system (i.e., Gandalf)
- Utilize ARTKIT and OpenAI's GPT-4o to create an attacker LLM that employs password extraction techniques in its adversarial prompts while engaging in multi-turn conversations until the password is divulged.
OpenAI API keys can be found on the API key page.
(5) Step-by-Step Walkthrough
Let's review the steps for using ARTKIT to extract passwords in the Gandalf challenge. You can find the accompanying Jupyter notebook here.
(5.1) Install ARTKIT
ARTKIT can be installed via either PyPI (pip install artkit
) or Conda (conda install -c conda-forge artkit
). For this project, I am using version 1.0.7.
Because ARTKIT provides out-of-the-box support for popular model providers like OpenAI and Anthropic, there is no need to install those packages separately.
Since we will utilize services from external model providers, it is recommended to store the access keys in .env
files and load them with python-dotenv
. The steps to do so can be found here.
(5.2) Load Dependencies
We load the necessary dependencies and access keys:
(5.3) Create Class to Access Gandalf
To facilitate interaction with the LLM behind Gandalf, we create a class called GandalfChat
to encapsulate the necessary functionality to chat with Gandalf and handle message formatting and response processing.
Let's take a closer look at theGandalfChat
class:
GandalfChat
inherits from ARTKIT'sHTTPXChatConnector
class. Because Gandalf serves as a custom HTTP endpoint rather than a standalone LLM object,HTTPXChatConnector
enables us to establish a seamless connection to itLevel
enumeration structures the difficulty levels so that they can be referenced using member names likeLEVEL_01
.build_request_arguments
formats the request to include arguments such as the difficulty level and input prompt.parse_httpx_response
processes the LLM output based on the HTTP response object returned by the API.get_default_api_key_env
provides the environment variable name where the API key for the chat system might be stored.
In addition to supporting custom systems like Gandalf, ARTKIT also offers pre-built classes for seamless connections to popular LLM platforms like OpenAI and AWS Bedrock.
This integration flexibility is another key strength of ARTKIT, enabling efficient red-teaming against mainstream LLMs.
Details about
HTTPXConnector
can be found in "Calling custom endpoints via HTTP" __ section of this tutorial.
(5.4) Instantiate Chat Model Instance for Gandalf
We create an instance of GandalfChat
as a model object containing the URL of the Gandalf API endpoint and the desired difficulty level. In this example, we will be tackling Level 4.
We also utilize ARTKIT's CachedChatModel
as a wrapper around GandalfChat
to enable the caching of responses into an SQLite database (gandalf_cache.db
).
The advantage of storing these chat interactions is that we can minimize redundant API calls for repeated queries, which in turn speeds up response time and reduces costs.
We also set a 10-second cutoff using Timeout
to limit the wait time for API responses, ensuring our requests do not hang indefinitely.
(5.5) Setup Attacker LLM
We use OpenAI's GPT-4o model to jailbreak Gandalf, which we instantiate using the OpenAIChat
class designed for OpenAI LLMs:
Just as we did with the Gandalf chat object, we use CachedChatModel
to wrap the GPT-4o LLM to enable response caching.
(5.6) Define Attacker LLM Objective and System Prompts
With the attacker LLM ready, we proceed with prompt engineering to define the objective prompt and the attacker system prompt.
Because we will use ARTKIT's multi-turn interaction feature, we need to explicitly specify a separate prompt to describe the attacker LLM's objective (i.e., make Gandalf reveal its password) so that its responses are well-guided.
The objective prompt is saved as a dictionary in a list because we can store and use more than one objective for different steps in the pipeline.
Next, we define the system prompt that guides the attacker LLM's strategy in extracting passwords. The prompt is crafted such that the attacker LLM devises indirect and creative techniques to mislead Gandalf about the true intent of the inquiry, thereby bypassing its guardrails.
Notice that the system prompt contains the {objective}
parameter in which the objective prompt will be dynamically injected.
Moreover, we need to include the following two dynamic parameters for multi-turn interactions to happen:
{max_turns}
: Maximum turns allowed for the LLM to accomplish the objective, preventing it from engaging in endless conversations.{success_token}
: The token to output when the LLM achieves its objective. It serves as a signal to terminate the conversation early.
(5.7) Run Multi-Turn Interactions with Gandalf
We are now one step away from initiating our jailbreaking attempts on Gandalf; all that remains is to link the components and execute them.
ARTKIT's run
function allows us to orchestrate the execution of a sequence of steps in a processing pipeline and return the result of the executed flow.
Here is a look at the parameters of ak.run
:
- The
input
parameter accepts theobjectives
variable that contains the objective defined in the preceding step. - The
steps
parameter accepts a set of steps to execute, where each step is defined with theak.step
function. In this case, we only have oneak.step
step to run, which is to conduct multi-turn interactions between the attacker LLM (challenger_llm
) and Gandalf (target_llm
). - In the
ak.step
function, we useak.multi_turn
to orchestrate multi-turn conversations, thereby maintaining the context and conversation history. - We also specify the success token (
success_token
), the maximum turns (max_turns
), and the attacker LLM system prompt (attacker_prompt
).
Executing the code above kickstarts the multi-turn interactions aimed at breaking through Gandalf's defenses, with the output saved in results
.
(5.8) View Interaction History and Secret Password
After executing the jailbreak, it is time for us to review the outcome. We run the following helper code to structure the conversation history and output more clearly.
And now, the moment of truth! Below is an instance of the interactions between our attacker LLM and Gandalf:
Through a series of clever prompt techniques (e.g., generating riddles, extracting letters), we successfully extracted the hidden password (i.e., UNDERGROUND) despite Gandalf's best efforts to guard it.
Spoiler: Password for every level can be found here.
(6) Wrap-up
In this article, we demonstrated how ARTKIT can be used for automated prompt-based testing of LLM systems to unveil jailbreaking vulnerabilities. Leveraging LLMs' capabilities for model-based red-teaming offers a powerful means to scale and accelerate the testing of LLM systems.
While we focused on Level 4 in this showcase, the ARTKIT setup was able to smoothly overcome Levels 1 to 6. Beyond that, human intervention became necessary, involving advanced prompt engineering and parameter adjustments.
This highlights the importance of combining automation with human-led red-teaming, where automation saves time by identifying basic vulnerabilities, allowing humans to focus on more complex risks.
The integration of human oversight can be tailored to different sophistication levels, ensuring a balanced and comprehensive testing approach.
Before you go
I welcome you to follow this Medium page and visit my GitHub to stay updated with more engaging and practical content. Meanwhile, have fun red-teaming LLM systems with ARTKIT!
Inside the Leaked System Prompts of GPT-4, Gemini 1.5, and Claude 3