Democratizing Machine Learning with AWS SageMaker AutoML

Author:Murphy | View: 26803 | Time: 2025-03-23 18:49:55

Introduction

AI is still one of the hottest topics as of now, especially with the rise of ChatGPT. Many companies are now trying to make use of AI in order to extract useful insights from data that can be used to optimize their processes or bring out better products.

However, building effective AI models requires a lot of expertise in different fields, like data preprocessing, model selection, hyperparameter tuning and many more. All these fields can be time-consuming and require specialized knowledge.

This is where Automl comes into play. AutoML automates many of the above-mentioned fields required for building an AI model.

AutoML is rapidly becoming a popular solution for businesses and data scientists. It empowers organizations to leverage ML and AI to make informed decisions, without requiring them to be experts in data science. With the increasing demand for ML in businesses, AutoML provides an easy and efficient way to create accurate models, regardless of one's expertise.

In this article, we'll examine one very popular AutoML tool available in the market today, AWS SageMaker AutoML, and demonstrate how it can be used to solve complex ML use cases.

I will train a model with the old-fashioned manual approach and compare the results to the ones that AWS SageMaker AutoML produces.

I will use the credit card fraud detection dataset from Kaggle for this comparison [1]. You can find the dataset here.

By the end of this article, you'll have a clear understanding of how AutoML can help leverage ML to drive meaningful insights and make informed decisions.

AWS SageMaker AutoML

Figure 1 gives an overview on the different steps that AWS SageMaker AutoML solves.

It includes the following steps:

Data Preparation: You can easily upload your data to Amazon S3. Once your data is uploaded, SageMaker AutoML automatically analyzes your data in order to detect any missing values, outliers or data types that need to be transformed.
Automatic Model Creation: AWS SageMaker AutoML automatically trains multiple Machine Learning models with different hyperparameters and algorithms to determine the best model for your data. It also provides automatic model tuning, which adjusts the hyperparameters of the selected models to further optimize their performance. It also creates the notebooks for running the model selection automatically for you, so that you have a full visibility about what is executed during this process.
Model Deployment: Once the best model has been selected, AWS SageMaker offers to deploy the model to a SageMaker endpoint or a batch transform job, where it can be used to make predictions on new data. On top of it, AWS SageMaker Model Monitor can be utilized in order to be alerted if any issues arise (like data drifts, concept drifts, …). It also provides tools for retraining the model with new data, as well as updating the model's hyperparameters or algorithms to improve its performance.

AWS SageMaker AutoML offers a Python SDK that can be used for starting your AutoML job and a GitHub repository with various different notebook examples on how to utilize the AutoML SDK for concrete ML use cases.

There are also other powerful and well-known AutoML tools available in the market, such as Google Cloud AutoML and H2O.ai, which also have their own unique strengths and weaknesses.

Google Cloud AutoML is known for its ease of use and intuitive interface, which makes it perfect in case you are new to ML and also not that deep into coding. Google Cloud AutoML supports image data, video data, text data and tabluar data. You can read more about that here.

H20.ai is known for its speed and scalability, making it a good option for large datasets and complex models. H20.ai offers interfaces in R, Python, or a web GUI. You can read more about its features here.

Manual Training Approach

Before I'm going to use AWS SageMaker AutoML to come up with a classifier for the credit card dataset, I first train a model in the classic way: Doing everything myself from scratch.

This later helps me to have a baseline and to compare my own approach to the AutoML approach from AWS, with the expectation that AWS SageMaker AutoML outperforms my manual, semi-optimal, approach.

For the manual approach, I make use of Scikit-learn and I will run through the steps highlighted in the next chapters.

You can also find the complete notebook in my GitHub repository here.

Data preparation

I load the dataset from a CSV file and first check the distribution of the dataset. This shows that the dataset is highly imbalanced, with only 0.17% of all samples being positive.

The dataset itself doesn't contain any missing values.

I then split the dataset with an 80/20 split into train and test and scale the data to be in the range 0–1, while the standard scaler is only trained on the training set to avoid some overly optimistic results.

You can find the code for these steps below.

import sys
import os
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# step 1: Load the dataset from the csv file. 
# You can download the dataset from Kaggle
filepath = os.path.join("data", "creditcard.csv")
df = pd.read_csv(filepath)

# step 2: check data imbalance on target
count_neg_class = np.sum(df["Class"] == 0)
count_pos_class = np.sum(df["Class"] == 1)

print(f"There are {count_neg_class} negative samples ({np.round(100 * count_neg_class / num_samples, 2)} % of total data).")
print(f"There are {count_pos_class} positive samples ({np.round(100 * count_pos_class / num_samples, 2)} % of total data).")

# step 3: split data into train and test set
X = df.drop(columns="Class").to_numpy()
y = df["Class"].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# step 4: scale the data
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Typically, extensive Exploratory Data Analysis (EDA) would also be part of the data preparation step. But for the sake of this experiment, I did not do extensive EDA, as the dataset is already well-prepared for ML.

But just keep in mind that this part is also crucial to the success of your ML training and also takes some time typically.

Model Selection

The next step is to figure out what ML algorithm is best suited for the data. For this purpose, I first train a very simple baseline model using logistic regression. This is to already have something simple that I can then compare more complex algorithms to.

The goal should always be: Keep it simple! Don't start with a Neural Network, which is harder to explain at the end, if also a simple algorithm, like logistic regression, could do the job.

The logistic regression model achieved an F1-Score of 70.6%. I am using the F1-Score for this dataset, as it is highly imbalanced and the accuracy would not deliver a meaningful measure of the model, as only predicting all classes as negative would already lead to an accuracy of more than 99%!

You can find the code for training the baseline model below.

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix

from sklearn.linear_model import LogisticRegression

log_model = LogisticRegression()
log_model.fit(X_train, y_train)

preds = log_model.predict(X_test)
print(f"Test Acc: {accuracy_score(y_test, preds)}")
print(f"Test F1-Score: {f1_score(y_test, preds)}")
print(f"Test Precision: {precision_score(y_test, preds)}")
print(f"Test Recall: {recall_score(y_test, preds)}")

Okay, we now have a baseline. Let's now try out different classification algorithms with their default hyperparameters, and let's see what algorithm performs best on the data.

I used a 5-fold cross-validation to train each of the following models:

decision tree
supported vector machine
k-nearest neighbors
random forest
ada-boost

from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_predict

dict_models = {
    "Decision Tree": DecisionTreeClassifier(),
    "SVM": SVC(),
    "Nearest Neighbor": KNeighborsClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Ada Boost": AdaBoostClassifier()
}

# train all models by using the models dictionary
results_dict = {}
for model_name, model in dict_models.items():
    print(f"Start training {model_name}...")
    preds = cross_val_predict(model, X_train, y_train, cv=5)

    f1 = f1_score(y_train, preds)
    precision = precision_score(y_train, preds)
    recall = recall_score(y_train, preds)

    print(f"F1-Score: {f1}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print("nn")
    results_dict[model_name] = (f1, precision, recall)

# create a pandas dataframe with the results on sort on f1-score
df_results = (pd.DataFrame.from_dict(results_dict, orient="index", columns=["F1-Score", "Precision", "Recall"])
             .sort_values(by="F1-Score", ascending=False))
df_results

The random forest algorithm delivered the best results with an F1-Score of 86.9%, followed by the nearest neighbor algorithm with an F1-Score of 84.8%. Not bad!

Next step is to fine tune the winner (random forest).

For this, I selected some values of the hyperparameters that I want to try out and used a randomized cross-validation search in order to find the best set of hyperparameters leading to the best model.

The code for this evaluation:

from sklearn.model_selection import RandomizedSearchCV

params = {
    "n_estimators": [10, 20, 30, 60, 80, 100],
    "criterion" : ["gini", "entropy"],
    "max_depth" : [4, 5, 10, None],
    "min_samples_split": [2, 4, 6],
    "class_weight": [None, "balanced", "balanced_subsample"]
}

clf_rf = RandomizedSearchCV(RandomForestClassifier(), params, n_iter=50, scoring="f1", cv=5, verbose=1, n_jobs=-1)
clf_rf.fit(X_train, y_train)

# let's print the best score and save the best model
print(f"Best f1-score: {clf_rf.best_score_}")
print(f"Best parameters: {clf_rf.best_params_}")
best_random_forest_model = clf_rf.best_estimator_

The best model scored more or less the same as the one that I got without tuning the hyperparameters. What a waste of time

Tags: Automl AWS Data Science Machine Learning Python