When AutoML Meets Large Language Model

Author:Murphy | View: 26763 | Time: 2025-03-23 12:29:51

Automated machine learning, or Automl in short, aims to automate various steps of a machine learning pipeline. A crucial aspect of it is hyperparameter tuning. Hyperparameters refer to parameters that govern the structure and behavior of an ML algorithm (e.g., the number of layers in a neural network model), and the values of those parameters are set beforehand instead of learning from the data, unlike other ML parameters (e.g., weights and biases of a neural network layer).

The current practices used in AutoML for hyperparameter tuning rely heavily on sophisticated algorithms (such as Bayesian optimization) to automatically identify the optimal combination of the hyperparameters that yields the best model performance.

While approaching the hyperparameter tuning problems from a pure algorithmic point of view can work, there is another crucial piece of information that could complement the algorithmic search strategy: human expertise.

Senior data scientists, based on their years of experience and deep understanding of the ML algorithms, often intuitively know where to initiate the search, which regions of the search space might be more promising, or when to narrow down or expand the possibilities.

Therefore, this gives us a very interesting idea: can we design a scalable expertise-guided search strategy that leverages both the nuanced insights provided by the expert and the search efficiency offered by the AutoML algorithms?

This is where the large language models (Llm), such as GPT-4, can play a role. Among their vast amount of training data, a certain portion of the corpus contains texts that are dedicated to explaining and discussing the best practices of machine learning (ML). Thanks to that, the LLMs are able to internalize a substantial amount of ML expertise and obtain a significant chunk of the collective ML wisdom. This positions LLMs as potential knowledgable ML experts who can interact with the existing AutoML tool to collaboratively perform expert-guided hyperparameter tuning.

An illustration of LLM-guided hyperparameter tuning. (Image by author)

In this blog post, let's take this idea for a spin and implement an LLM-guided hyperparameter tuning workflow. Specifically, we will develop a workflow that combines LLM guidance with a simple random search, and then compare its tuning results against an off-the-shelf, state-of-the-art, algorithmic-based AutoML tool called FLAML (from Microsoft Research). The rationale for this comparison is to assess if tapping into ML domain expertise can truly bring value, even when paired with a straightforward search algorithm.

Does this idea resonate with you? let's get started!

[NOTE]: All the prompts shown in this blog are generated and optimized by ChatGPT (GPT-4). This is necessary as it ensures the quality of the prompts and beneficial as it saves one from tedious manual prompt engineering.

This is the 4rd blog on my series of LLM projects. The 1st one is Building an AI-Powered Language Learning App, the 2nd one is Developing an Autonomous Dual-Chatbot System for Research Paper Digesting, and the 3rd one is Training Problem-Solving Skills in Data Science with Real-Life Simulations. Feel free to check them out!

Table of Content

· 1. Case Study ∘ 1.1 Dataset description ∘ 1.2 Model description · 2. Workflow Design · 3. Configuring the Chatbot · 4. Suggesting Optimization Metric · 5. Defining Initial Search Space · 6. Refining Search Space · 7. Tuning Log Analysis ∘ 7.1 Random search with successive halving ∘ 7.2 Log analysis · 8. Case Study ∘ 8.1 Determining metric ∘ 8.2 1st search iteration ∘ 8.3 2nd search iteration ∘ 8.4 3rd search iteration ∘ 8.5 Testing · 9. Comparison Against Out-of-Box AutoML Tool · 10. Conclusion

1. Case Study

To ground our discussion in a concrete example, let's start by introducing the case study we will investigate.

In this blog, we will be looking at a hyperparameter tuning task for a binary classification problem. More specifically, we will investigate a cybersecurity dataset called the NSL-KDD dataset and identify the optimal hyperparameters of an XGBoost model such that the trained model can accurately distinguish benign and attack activities.

1.1 Dataset description

The NSL-KDD dataset is a widely used dataset in the field of network intrusion detection. The full dataset contains four attack categories, i.e., dos(Denial of Service), r2l(unauthorized Access from a Remote Machine), u2r(privilege escalation attempts), as well as probe (brute-force probing attacks). For our current case study, we investigate a binary classification problem: we only consider data samples that are either "benign" or "probe" in nature and train a classifier that can distinguish between those two states.

The NSL-KDD dataset consists of 40 features (e.g., connection length, protocol type, transmitted data bytes, etc.) that are derived from raw network traffic data, and capture various characteristics of individual network connections. The dataset has been pre-divided into a train and a test set. The following table shows the number of samples in both sets:

Number of samples in train and test set. (Image by author)

Notice that our current dataset is imbalanced. This is typical in cybersecurity applications as the available number of attack samples is usually much smaller than the number of benign samples.

You can find the pre-processed datasets here.

1.2 Model description

Although the full AutoML entails model selection, in the current case study, we limit our scope to only optimizing the XGBoost model. XGBoost model is known for its versatile performance, under the condition that the right model hyperparameters are being used. Unfortunately, given the large number of tunable hyperparameters of XGBoost, it is non-trivial to identify the optimal hyperparameter combination for a given dataset. This is exactly where AutoML can bring value.

We consider tuning the following XGBoost hyperparameters:

n_estimators, max_depth, min_child_weight, gamma, scale_pos_weight, learning_rate, subsample, colsample_bylevel, colsample_bytree, reg_alpha, and reg_lambda.

For detailed descriptions of the above-mentioned hyperparameters, please refer to the official documentation.

2. Workflow Design

To realize the LLM-guided hyperparameter tuning, there are two questions we need to answer: how should we incorporate the LLM's ML expertise into the tuning process? And how should we let LLM interact with the tuning tool?

1️⃣ How to incorporate the LLM's ML expertise into the tuning process?

Well, there are at least three places in the tuning process where LLM's ML expertise can provide guidance:

Suggest optimization metric: Hyperparameter tuning usually requires a metric to define what is deemed as "optimal" among various competing hyperparameter combinations. This sets the target for the underlying optimization algorithms (e.g., Bayesian optimization) in AutoML. LLMs can provide insights into which metrics are more suitable for specific types of problems and dataset characteristics and potentially offer explanations of the advantages and disadvantages of candidate metrics.
Suggest initial search space: As most hyperparameter tuning tasks are conducted in an iterative manner, it is usually necessary to configure an initial search space to set the stage for the optimization process. LLMs, based on their learned ML best practices, can recommend hyperparameter ranges that are meaningful to the investigated dataset characteristics and the selected ML model. This can potentially reduce the need for unnecessary explorations, thus saving a significant amount of computation time.
Suggest refinement for search space: As the tuning process progresses, it is usually necessary to refine the configured search space. Refinement can take two directions, i.e., narrowing down the ranges of certain hyperparameters to regions that show promise, or expanding the ranges of certain hyperparameters to explore new areas. LLMs, by analyzing the optimization logs from the previous rounds, can automatically propose new refinements that drive the tuning process to more promising results.

2️⃣ How to let LLM interact with the tuning tool?

Ideally, we could wrap the random search tool as an API and implement an LLM-based agent that can access this API to perform tuning. However, given the limited time, I didn't manage to configure a working agent that could reliably perform the iterative tuning process I outlined above: sometimes the agent couldn't properly use the tool due to the incorrect input schema; other times the agent simply diverged from the task.

Alternatively, I implemented a simple chatbot-based workflow, which consists of the following two components:

A chatbot with memory. Here, memory is important because the chatbot needs to recall the previously suggested search space.
Three prompts correspond to suggesting optimization metric, suggesting initial search space, and suggesting search space refinement, respectively.

To kick-start the workflow, the chatbot is first prompted to suggest a suitable optimization metric based on the problem context and the dataset characteristics. The chatbot's response will then be parsed and the name of the metric will be extracted and stored as a variable. As a second step, the chatbot is prompted to suggest an initial search space. Same as in the first step, the response will be parsed and the search space will be extracted and stored as a variable. With both pieces of information available, the random search tool will be invoked with the chatbot-suggested metric and search space. Overall, this constitutes the first round of iterations.

Once the random search tool completes the search, the chatbot will be prompted to recommend refinement of the search space based on the results obtained from the last run. After a new search space is successfully parsed from the chatbot's response, another round of random search will proceed. This process iterates until either the computational budget is exhausted, or the convergence is obtained, e.g., no more improvement over the previous run.

An illustration of a chatbot-based solution for achieving LLM-guided hyperparameter tuning. (Image by author)

Although not as elegant as the agent-based approach, this chatbot-based workflow bears a couple of benefits:

Easy to implement and realize quality control. The integration of a chatbot with the tuning tool is simplified to designing three prompts and a couple of helper functions to extract target information from the chatbot's response. This is much simpler compared to a fully integrated agent-based approach. Also, quality control becomes more manageable as the hyperparameter tuning task is divided into explicit steps, where each step can be monitored and adjusted, ensuring that the search process remains on track.
Fully transparent decision-making. As the chatbot will clearly articulate each decision or recommendation it has made, the hyperparameter tuning process is no longer a black-box process for the user. This is crucial for achieving interpretable and trustworthy AutoML.
Allow incorporation of human intuition. Although the current workflow is designed to be autonomous, it can be trivially extended to allow the human expert to either choose to accept the advice or make necessary adjustments based on their own expertise, before each search iteration. This flexibility opens the door for human-in-the-loop tuning and potentially leads to better optimization results.

In the next sections, we will go through the details of configuring the chatbot, as well as the prompts for suggesting the metric, initial search space, and search space refinement, individually.

3. Configuring the Chatbot

We can easily set up a chatbot with memory using LangChain. We start by importing the necessary libraries:

from langchain.prompts import (
    ChatPromptTemplate, 
    MessagesPlaceholder, 
    SystemMessagePromptTemplate, 
    HumanMessagePromptTemplate
)
from langchain.chains import ConversationChain
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory

In LangChain, a chatbot can be configured by anConversationChain object, which requires a backbone LLM, a memory object to hold conversation history, as well as a prompt template to specify the chatbot's behavior:

# Set up LLM
llm = ChatOpenAI(temperature=0.8)

# Set up memory
memory = ConversationBufferMemory(return_messages=True)

# Set up the prompt
prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template(system_message),
    MessagesPlaceholder(variable_name="history"),
    HumanMessagePromptTemplate.from_template("""{input}""")
])

# Create conversation chain
conversation = ConversationChain(memory=memory, prompt=prompt, 
                                llm=llm, verbose=False)

In the code snippet above, the variable system_message sets the context for the chatbot, which is shown below:

# Note: the prompt is generated and optimized by ChatGPT (GPT-4)
system_message = f"""
You are a senior data scientist tasked with guiding the use of an AutoML 
tool to discover the best XGBoost model configurations for a given binary 
classification dataset. Your role involves understanding the dataset 
characteristics, proposing suitable metrics, hyperparameters, and their 
search spaces, analyzing results, and iterating on configurations. 
"""

Note that we have specified the role, the purpose, and the expected behavior of the chatbot in the system message. Later, we can make inferences with the configured chatbot using:

# Invoke chatbot with the input prompt
response = conversation.predict(input=prompt)

where the promptis the specific instructions we have for the chatbot, i.e., suggesting optimization metric, suggesting initial search space, or suggesting refining search space. We will cover those prompts in the next couple of sections.

4. Suggesting Optimization Metric

As the first step of LLM-guided hyperparameter tuning, we would like the chatbot to propose a suitable metric for the tuning tool to optimize. This decision should be based on several factors, including the nature of the problem (i.e., regression or classification), the characteristics of the given dataset, and any other specific requirements from business or real-world implications.

The following snippet defines the prompt to achieve our goal:

def suggest_metrics(report):

    # Note: The prompt is generated and optimized by ChatGPT (GPT-4)
    prompt = f"""
    The classification problem under investigation is based on a network 
    intrusion detection dataset. This dataset contains Probe attack type, 
    which are all grouped under the "attack" class (label: 1). Conversely, 
    the "normal" class is represented by label 0. Below are the dataset's 
    characteristics:
    {report}.

    For this specific inquiry, you are tasked with recommending a suitable 
    hyperparameter optimization metric for training a XGBoost model. It is 
    crucial that the model should accurately identify genuine threats (attacks) 
    without raising excessive false alarms on benign activities. They are equally 
    important. Given the problem context and dataset characteristics, suggest 
    only the name of one of the built-in metrics: 
    - 'accuracy'
    - 'roc_auc' (ROCAUC score)
    - 'f1' (F1 score)
    - 'balanced_accuracy' (It is the macro-average of recall scores per class 
    or, equivalently, raw accuracy where each sample is weighted according to 
    the inverse prevalence of its true class) 
    - 'average_precision'
    - 'precision'
    - 'recall'
    - 'neg_brier_score'

    Please first briefly explain your reasoning and then provide the 
    recommended metric name. Your recommendation should be enclosed between 
    markers [BEGIN] and [END], with standalone string for indicating the 
    metric name.
    Do not provide other settings or configurations.
    """

    return prompt

In the prompt shown above, we have informed the chatbot of the context of the problem, the dataset characteristics (encapsulated in the variable report, which we will discuss later), the objective (recommending a suitable hyperparameter optimization metric for training an XGBoost model), the specific requirements (high detection rate and low false alarm rate), and the candidate metrics (they are all supported in sci-kit learn). Note that we explicitly asked the chatbot to output the metric name inside a [BEGIN]-[END] block, which eases the automatic extraction of the information.

To generate the data report, we can define the following function:

def data_report(df, num_feats, bin_feats, nom_feats):
    """
    Generate data characteristics report.

    Inputs:
    -------
    df: dataframe for the dataset.
    num_feats: list of names of numerical features.
    bin_feats: list of names of binary features.
    nom_feats: list of names of nominal features.

    Outputs:
    --------
    report: data characteristics report.
    """

    # Label column 
    target = df.iloc[:, -1]
    features = df.iloc[:, :-1]

    # General dataset info
    num_instances = len(df)
    num_features = features.shape[1]

    # Class imbalance analysis
    class_counts = target.value_counts()
    class_distribution = class_counts/num_instances

    # Create report
    # Note: The format of the report is generated and optimized
    # by ChatGPT (GPT-4)
    report = f"""Data Characteristics Report:

- General information:
  - Number of Instances: {num_instances}
  - Number of Features: {num_features}

- Class distribution analysis:
  - Class Distribution: {class_distribution.to_string()}

- Feature analysis:
  - Feature names: {features.columns.to_list()}
  - Number of numerical features: {len(num_feats)}
  - Number of binary features: {len(bin_feats)}
  - Binary feature names: {bin_feats}
  - Number of nominal features: {len(nom_feats)}
  - Nominal feature names: {nom_feats}
"""

    return report

Here, we specifically calculated if the given dataset is imbalanced, as this information could impact the selection of the optimization metric.

Now we have completed the first prompt for instructing the chatbot to suggest a suitable metric for evaluating the performance of various hyperparameter configurations. Next, let's look at constructing the prompt for defining the initial search space.

5. Defining Initial Search Space

Besides the optimization metric, we still need an initial search space of the hyperparameters to start the tuning round. The following snippet shows the prompt for achieving that goal:

def suggest_initial_search_space():

    # Note: The prompt is generated and optimized by ChatGPT (GPT-4) 
    prompt = f"""
    Given your understanding of XGBoost and general best practices in machine 
    learning, suggest an initial search space for hyperparameters. 

    Tunable hyperparameters include:
    - n_estimators (integer): Number of boosting rounds or trees to be trained.
    - max_depth (integer): Maximum tree depth for base learners.
    - min_child_weight (integer or float): Minimum sum of instance weight 
    (hessian) needed in a leaf node. 
    - gamma (float): Minimum loss reduction required to make a further 
    partition on a leaf node of the tree.
    - scale_pos_weight (float): Balancing of positive and negative weights.
    - learning_rate (float): Step size shrinkage used during each boosting 
    round to prevent overfitting. 
    - subsample (float): Fraction of the training data sampled to train each 
    tree. 
    - colsample_bylevel (float): Fraction of features that can be randomly 
    sampled for building each level (or depth) of the tree.
    - colsample_bytree (float): Fraction of features that can be randomly 
    sampled for building each tree. 
    - reg_alpha (float): L1 regularization term on weights. 
    - reg_lambda (float): L2 regularization term on weights. 

    The search space is defined as a dict with keys being hyperparameter names, 
    and values are the search space associated with the hyperparameter. 
    For example:
        search_space = {{
            "learning_rate": loguniform(1e-4, 1e-3)
        }}

    Available types of domains include: 
    - scipy.stats.uniform(loc, scale), it samples values uniformly between 
    loc and loc + scale.
    - scipy.stats.loguniform(a, b), it samples values between a and b in a 
    logarithmic scale.
    - scipy.stats.randint(low, high), it samples integers uniformly between 
    low (inclusive) and high (exclusive).
    - a list of possible discrete value, e.g., ["a", "b", "c"]

    Please first briefly explain your reasoning, then provide the 
    configurations of the initial search space. Enclose your suggested 
    configurations between markers [BEGIN] and [END], and assign your 
    configuration to a variable named search_space.
    """

    return prompt

The prompt above contains quite a lot of information, so let's break it down:

We started by informing the chatbot about our objective.
We provided a list of tunable hyperparameters and their meanings. Additionally, we also indicate the expected data type of each hyperparameter. This piece of information is important for the LLM to decide the sampling distribution.
We defined the expected output format for the search space for effective parsing.
We indicated available sampling distributions the LLM can suggest.

Note that we explicitly asked the chatbot to briefly explain the rationale behind the suggested search space. This is crucial for achieving transparency and interpretability.

That's it for recommending the initial search space. Next, we look at refining the search space.

6. Refining Search Space

After obtaining the optimization metric and initial search space from the chatbot, we can kick-start a tuning round. Afterward, we can feed the generated AutoML logs into the chatbot and prompt it to suggest refinements to the search space.

The following snippet shows the prompt to achieve the goal:

def suggest_refine_search_space(top_n, last_run_best_score, all_time_best_score):
    """
    Generate prompt for refining the search space.

    Inputs:
    -------
    top_n: string representation of the top-5 best-performing configurations.
    last_run_best_score: best test score from the last run.
    all_time_best_score: best test score from all previous runs.

    Outputs:
    --------
    prompt: generated prompt.
    """

    # Note: The prompt is generated and optimized by ChatGPT (GPT-4)
    prompt = f"""
    Given your previously suggested search space, the obtained top configurations 
    with their test scores:
    {top_n}

    The best score from the last run was {last_run_best_score}, while the best 
    score ever achieved in all previous runs is {all_time_best_score}

    Remember, tunable hyperparameters are: n_estimators, max_depth, min_child_samples, 
    gamma, scale_pos_weight, learning_rate, subsample, colsample_bylevel, 
    colsample_bytree, reg_alpha, and reg_lambda.

    Given the insights from the search history, your expertise in ML, and the 
    need to further explore the search space, please suggest refinements for 
    the search space in the next optimization round. Consider both narrowing 
    and expanding the search space for hyperparameters where appropriate.

    For each recommendation, please:
    1. Explicitly tie back to any general best practices or patterns you are 
    aware of regarding XGBoost tuning
    2. Then, relate to the insights from the search history and explain how 
    they align or deviate from these practices or patterns.
    3. If suggesting an expansion of the search space, please provide a 
    rationale for why a broader range could be beneficial.

    Briefly summarize your reasoning for the refinements and then present the 
    adjusted configurations. Enclose your refined configurations between 
    markers [BEGIN] and [END], and assign your configuration to a variable 
    named search_space.
    """

    return prompt

There are a couple of things worth explaining:

We supply the chatbot with the top-5 best-performing configurations as well as their associated test scores from the last tuning round. This could serve as the base for the chatbot to determine the next round of refinement.
By including the best test scores of the last run and all previous runs, the chatbot could know if the suggested search space from the previous run is effective or not.
We explicitly ask the chatbot to consider further exploring the search space. Generally, the best practice in optimization is to balance exploration and exploitation. Since our employed random search algorithm (which will be discussed in the results section) already entailed the exploitation, it makes sense that we instruct the LLM to focus more on exploration.
As with the other prompts, we asked the LLM to reason about its suggestions for transparency and interpretability.

In the next section, let's take a look at the log analysis logic that produces the top_n variable.

7. Tuning Log Analysis

As an iterative approach, we would like the chatbot to propose a new search space once the previous search run has been completed, and the basis for that decision-making process should be the log produced during the last search run. In this section, we first introduced the search algorithm employed in the current case study. Then, we discuss the log structure and the code to extract useful insights.

7.1 Random search with successive halving

As mentioned at the beginning of this post, we would like to couple a simple hyperparameter search algorithm with the LLM to examine if domain expertise can bring value.

One of the simplest tuning approaches is random search: for a defined search space (which is specified by attaching sampling distributions to the hyperparameters), a given number of instances of hyperparameter combinations are sampled and their associated model performances are evaluated. The sampled hyperparameter configuration which produced the best performance is deemed as the best.

Despite its simplicity, the naive version of the random search may lead to inefficient use of computational resources, as it does not discriminate between hyperparameter configurations and poor hyperparameter choices are trained anyway.

To address this issue, the successive halving technique is proposed to enhance the basic random search strategy. Essentially, a successive halving strategy first evaluates many configurations for a small number of resources, and then gradually allocates more resources to only the promising ones. As a result, poor configurations can be effectively discarded early on, therefore enhancing the search efficiency. Sci-kit Learn provides the[HalvingRandomSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingRandomSearchCV.html) estimator that exactly implements this strategy, which we will adopt in our current case study.

7.2 Log analysis

When running the [HalvingRandomSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingRandomSearchCV.html) search, the search logs are stored in the attribute cv_results_. The raw logs are in a dictionary format, which can be converted to a Pandas Dataframe for better insights extraction:

Logs produced by the search algorithm. (Examples taken from the official user guide.)

For our current purposes, we would like to extract the top-N (default 5) best-performing configurations as well as their associated test scores. The following function shows how we can achieve that:

def logs_analysis(results, N):
    """
    Extracting the top performing configs from the logs.

    Inputs:
    -------
    results: results dict produced by the sklearn search object.
    N: the number of top performing configs to consider.

    Outputs:
    --------
    top_config_summary: a string summary of the top-N best performing configs 
    and their associated test scores.
    last_run_best_score: the best test score obtained in the current search run.
    """

    # Convert to Dataframe
    df = pd.DataFrame(search.cv_results_)

    # Rank configs' performance
    top_configs = df.nsmallest(N, 'rank_test_score').reset_index(drop=True)

    # Considered hyparameters
    hyperparameter_columns = [
        'param_colsample_bylevel', 'param_colsample_bytree', 'param_gamma',
        'param_learning_rate', 'param_max_depth', 'param_min_child_weight',
        'param_n_estimators', 'param_reg_alpha', 'param_reg_lambda',
        'param_scale_pos_weight', 'param_subsample'
    ]

    # Convert to string
    config_strings = []
    for i, row in top_configs.iterrows():
        config = ', '.join([f"{col[6:]}: {row[col]}" for col in hyperparameter_columns])
        config_strings.append(f"Configuration {i + 1} ({row['mean_test_score']:.4f} test score): {config}")

    top_config_summary = 'n'.join(config_strings)

    # Best test score
    last_run_best_score = top_configs.loc[0, 'mean_test_score']

    return top_config_summary, last_run_best_score

8. Case Study

Now we have all the pieces, it's time to apply the LLM-guided workflow to our case study.

8.1 Determining metric

To begin with, we prompt the LLM to suggest a suitable optimization metric:

# Suggest metrics
prompt = suggest_metrics(report)
response = conversation.predict(input=prompt)
print(response)

The following is the response produced by the LLM:

LLM suggested a suitable metric and provided the reasoning. (Image by author)

We can see that the LLM recommended the "F1" score as the metric based on the problem context and dataset characteristics, which aligns with our expectations. In addition, the metric name is correctly enclosed between our specified markers for postprocessing.

8.2 1st search iteration

Next, we prompt the LLM to suggest an initial search space:

# Initial search space
prompt = suggest_initial_search_space()
response = conversation.predict(input=prompt)
print(response)

The output of LLM is shown below:

LLM suggested an initial search space and provided the reasoning. (Image by author)

We can see that the LLM has faithfully followed our instructions and provided detailed explanations regarding setting the variational range for each of the tunable hyperparameters. This is crucial for ensuring the transparency of the hyperparameter tuning process. Also, note that the LLM has managed to output a search space with the correct format. This shows the importance of giving specific examples in the prompt design.

Then, we invoke the random search with successive halving and run for the first iteration:

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
import xgboost as xgb

clf = xgb.XGBClassifier(seed=42, objective='binary:logistic', 
                        eval_metric='logloss', n_jobs=-1, 
                        use_label_encoder=False)
search = HalvingRandomSearchCV(clf, search_space, scoring='f1', 
                               n_candidates=500, cv=5, 
                               min_resources='exhaust', factor=3, 
                               verbose=1).fit(X_train, y_train)

Note that the HalvingRandomSearchCV estimator is still experimental. Therefore, it is necessary to first explicitly import enable_halving_search_cv before using the estimator.

Using the HalvingRandomSearchCV estimator requires setting up a couple of parameters. In addition to specifying the estimator (clf), the parameter distribution (search space), and the optimization metric (f1), we also need to specify the parameters that govern the time budget. For our current case, we set n_candidates to 500, which means that we will sample 500 candidate configurations (each configuration has a different hyperparameter combination) at the first iteration. Also, the factor is set to 3, meaning that only one-third of the candidates are selected for each subsequent iteration. Meanwhile, those selected one-third of the candidates will use 3 times more resources (i.e., the number of training samples) in the subsequent iteration. Finally, we set min_resource to ‘exhaust', meaning that at the last iteration, the remaining candidates will use all the available training samples. For a detailed description of setting up the HalvingRandomSearchCV estimator, please refer to this post: 11 Times Faster Hyperparameter Tuning with HalvingGridSearch.

A snapshot of the logs produced by running the HalvingRandomSearchCV estimator is shown below. The wall time for running the search is around 20 minutes on my PC.

Snapshot of the logs produced by the successive halving random search algorithm. (Image by author)

8.3 2nd search iteration

Once the search is completed, we can retrieve the search history stored in search.cv_resultsand send it to the previously-defined log_analysis() function to extract tuning insights. Afterward, we call suggest_refine_search_space() function to prompt the LLM to recommend a new search space based on the previous search results:

# Configure prompt
prompt = suggest_refine_search_space(top_n, last_run_best_score, all_time_best_score)

# Refine search space
response = conversation.predict(input=prompt)
print(response)

The response of the LLM is shown below:

LLM suggested a refinement of search space and provided the reasoning. (Image by author)

Here, we can see that the LLM has suggested narrowed space for some hyperparameters and expanded space for other hyperparameters. The rationale for those suggestions is also clearly articulated, which promotes interpretability and transparency.

Given the refined search space, we can run the HalvingRandomSearchCV estimator for the second time:

clf = xgb.XGBClassifier(seed=42, objective='binary:logistic', 
                        eval_metric='logloss', n_jobs=-1, 
                        use_label_encoder=False)
search = HalvingRandomSearchCV(clf, search_space, scoring='f1', 
                               n_candidates=500, cv=5, 
                               min_resources='exhaust', factor=3, 
                               verbose=1).fit(X_train, y_train)

The wall time for running the search is around 29 minutes on my PC.

8.4 3rd search iteration

Let's run one more iteration with the LLM-guided search. Same as before, we first extract useful insights from the previous search logs and then prompt the LLM to suggest further refinement for the search space. The response produced by the LLM is shown below:

Given the newly defined search space, we run the HalvingRandomSearchCV estimator for the third time:

clf = xgb.XGBClassifier(seed=42, objective='binary:logistic', 
                        eval_metric='logloss', n_jobs=-1, 
                        use_label_encoder=False)
search = HalvingRandomSearchCV(clf, search_space, scoring='f1', 
                               n_candidates=100, cv=5, 
                               min_resources='exhaust', factor=3, 
                               verbose=1).fit(X_train, y_train)

The wall time for running the search is around 11 minutes on my PC.

8.5 Testing

Now we have gone through three rounds of random search, let's test the performance of the obtained XGBoost model. For that, we can define a helper function to calculate various performance indices:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import (
    precision_score,
    recall_score,
    accuracy_score,
    roc_auc_score,
    f1_score,
    matthews_corrcoef
)

def metrics_display(y_test, y_pred, y_pred_proba):

    # Obtain confusion matrix
    cm = confusion_matrix(y_test, y_pred)

    # Output classification metrics
    tn, fp, fn, tp = cm.ravel()

    print(f'ROC_AUC score: {roc_auc_score(y_test, y_pred_proba):.3f}')
    print(f'f1 score: {f1_score(y_test, y_pred):.3f}')
    print(f'Accuracy: {accuracy_score(y_test, y_pred)*100:.2f}%')
    print(f'Precision: {precision_score(y_test, y_pred)*100:.2f}%')
    print(f'Detection rate: {recall_score(y_test, y_pred)*100:.2f}%')
    print(f'False alarm rate: {fp / (tn+fp)*100}%')
    print(f'MCC: {matthews_corrcoef(y_test, y_pred):.2f}')

    # Display confusion matrix
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot()

# Calculate performance metrics
y_pred = search.predict(X_test)
y_pred_proba = search.predict_proba(X_test)
metrics_display(y_test, y_pred, y_pred_proba[:, 1])

In the code snippet above, we applied the trained XGBoost model to the testing dataset and assessed its performance. The obtained confusion matrix is shown below:

The confusion matrix of applying the trained XGBoost model to the testing dataset. (Image by author)

Based on the confusion matrix, we can calculate various performance metrics:

Accuracy: 94.22%
Precision: 89.5%
Detection rate (recall): 80.52%
False alarm rate (1-Specificity): 2.35%
ROC-AUC score: 0.983
F1 score: 0.848
Matthews correlation coefficient: 0.81

9. Comparison Against Out-of-Box AutoML Tool

As we have mentioned in the beginning, we would like to compare the tuning results of the developed LLM-guided search against an off-the-shelf, algorithmic-based AutoML tool to assess if tapping into ML domain expertise can truly bring value.

The AutoML tool we will be employing is called FLAML, which is developed by Microsoft Research and stands for A Fast Library for Automated Machine Learning & Tuning. This tool is state-of-the-art and supports fast and economical automatic tuning, capable of handling large search space with heterogeneous evaluation costs and complex constraints/guidance/early stopping. For installing the library, please refer to the official page.

Using this tool is extremely simple: in the code snippet below, we first instantiate an AutoML object and then call its fit() method to kick off the hyperparameter tuning process. We limit the tuning to the XGBoost model only (FLAML also supports other model types) and set a time budget of 3600s, which is roughly the same as the total time we spent on 3-rounds of LLM-guided search (Random search time+ LLM response time).

from flaml import AutoML

automl = AutoML()
automl.fit(X_train, y_train, task="classification", time_budget=3600, 
          estimator_list=['xgboost'], log_file_name='automl.log', 
          log_type='best')

The results comparison between the LLM-guided search and out-of-box FLAML is shown in the following:

Results comparison between two tuning approaches. (Image by author)

We can see that the LLM-guided search has yielded better results than the out-of-box FLAML for all of the considered metrics. Since we are looking at a cybersecurity application, the two most important metrics are the detection rate and the false alarm rate. Here, we can see that the LLM-guided search has managed to significantly improve the detection rate while slightly lowering the false alarm rate. As a result, the XGBoost model trained via the LLM-guided search would be a superior anomaly detector compared to the FLAML-searched one.

Overall, we can conclude that for our current case study, leveraging the ML domain expertise embedded in the LLM can indeed bring value in hyperparameter tuning, even when paired with a straightforward search algorithm.

10. Conclusion

In this blog, we investigated a new paradigm of AutoML: LLM-guided hyperparameter tuning. Here, the key idea is to treat the large language model as an ML expert and leverage its ML domain knowledge to propose a suitable optimization metric, suggest initial search space, as well as recommend refinement for the search space.

Later, we applied this approach to identify the optimal XGBoost model for a cybersecurity dataset, and our results indicated that the informed hyperparameter search (i.e., the LLM-guided search) yielded a better anomaly detection model than the pure algorithmic-based AutoML tool FLAML, achieving a higher detection rate and a lower false alarm rate.

If you find my content useful, you could buy me a coffee here

Tags: Artificial Intelligence Automl Deep Dives Llm Machine Learning