Introduction to PyTorch: from training loop to prediction

Author:Murphy  |  View: 28406  |  Time: 2025-03-23 19:10:37
Image by author.

In this post we will cover how to implement a logistic regression model using PyTorch in Python.

PyTorch is one of the most famous and used deep learning frameworks by the community of data scientists and machine learning engineers in the world, and thus learning this tool becomes an essential step in your learning path if you want to build a career in the field of applied AI.

It joins TensorFlow, another very famous deep learning framework developed by Google.

There are no notable fundamental differences, except for the structure and organization of their APIs, which can be very different.

While both frameworks allow us to create very complex neural networks, PyTorch is generally preferred due to its more pythonic style and the freedom it allows the developer to integrate custom logic into the software.

We will use the Sklearn breast cancer dataset, an open source dataset already used previously in some of my previous article to train a binary classification model.

The goal is to explain how to:

  • go from a pandas dataframe to PyTorch's Datasets and DataLoaders
  • create a neural network for binary classification in PyTorch
  • create predictions
  • evaluate the performance of our model with utility functions and matplotlib
  • use this network to make predictions

By the end of this article we will have a clear idea of how to create a neural network in PyTorch and how the training loop works.

Let's get started!

Install PyTorch and other dependencies

We start our project by creating a virtual environment in a dedicated folder.

Visit this link to learn how to create a virtual environment with Conda.

How to Set Up a Development Environment for Machine Learning

Once our virtual environment has been created, we can run the command

$ pip install torch -U

in the terminal. This command will install the latest version of PyTorch, which as of this writing is version 2.0.

Starting a notebook, we can check the library version using torch.__version__ after doing import torch.

We can verify that PyTorch is correctly installed in the environment by importing and launching a small test script, as shown in the official guide.

import torch

x = torch.rand(5, 3)
print(x)

>>> tensor([[0.3890, 0.6087, 0.2300],
        [0.1866, 0.4871, 0.9468],
        [0.2254, 0.7217, 0.4173],
        [0.1243, 0.1482, 0.6797],
        [0.2430, 0.4608, 0.8886]])

If the script executes correctly then we are ready to proceed with the project. Otherwise I suggest the reader to refer to the official guide located here https://pytorch.org/get-started/locally/.

Let's continue with the installation of the additional dependencies:

  • Sklearn; pip install scikit-learn
  • Pandas; pip install pandas
  • Matplotlib; pip install matplotlib

Libraries like Numpy are automatically install when you install PyTorch.

Import and explore the dataset

Let's start by importing the installed libraries and breast cancer dataset from Sklearn with the following code snippet

import torch
import pandas as pd
import numpy as np

from sklearn.datasets import load_breast_cancer

import matplotlib.pyplot as plt

breast_cancer_dataset = load_breast_cancer(as_frame=True, return_X_y=True)

Let's create a dataframe dedicated to holding our X and y like this

df = breast_cancer_dataset[0]
df['target'] = breast_cancer_dataset[1]
df
Example of the dataframe. Image by author.

Our goal is to create a model that can predict the target column based on the characteristics in the other columns.

Let's go do a minimum of exploratory analysis to get some awareness of the dataset. We will use the sweetviz library to automatically create an analysis report.

We can install sweetviz with the command pip install sweetviz and create an EDA (exploratory data analysis) report with this piece of code

import sweetviz as sv

eda_report = sv.analyze(df)
eda_report.show_notebook()
Sweetviz analyzing our dataset. Image by author.

Sweetviz will create a report right in our notebook for us to explore.

"Association" tab in Sweetviz. Image by author.

We see how several columns are highly associated with a value of 0 or 1 of our target column.

Being a multidimensional dataset and having variables with different distributions, a neural network is a valid option to model this data. That said, this dataset can also be modeled by simpler models, such as decision trees.

We will now import two other libraries in order to visualize the dataset. We will use PCA (Principal Component Analysis) from Sklearn and Seaborn to visualize the multidimensional dataset.

PCA will help us compress the large number of variables into just two, which we will use as the X and Y axis in a Seaborn scatterplot. Seaborn takes an additional parameter called hue to color the dots based on an additional variable. We will use our target.

import seaborn as sns
from sklearn import decomposition

pca = decomposition.PCA(n_components=2)

X = df.drop("target", axis=1).values
y = df['target'].values

vecs = pca.fit_transform(X)
x0 = vecs[:, 0]
x1 = vecs[:, 1]

sns.set_style("whitegrid")
sns.scatterplot(x=x0, y=x1, hue=y)
plt.title("Proiezione PCA")
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.xticks([])
plt.yticks([])
plt.show()
PCA projection of the breast cancer dataset. Image by author.

We see how class 1 data points group based on common characteristics. It will be the goal of our neural network to classify the rows between targets 0 or 1.

Create the datasets and dataloaders classes

PyTorch provides Dataset and DataLoader objects to allow us to efficiently organize and load our data into the neural network.

It would be possible to use pandas directly, but this would have disadvantages because it would make our code less efficient.

The Dataset class allows us to specify the right format for your data and apply the retrieval and transformation logics that are often fundamental (think of the data augmentation applied to images).

Let's see how to create a PyTorch Dataset object.

from torch.utils.data import Dataset

class BreastCancerDataset(Dataset):
    def __init__(self, X, y):
        # create feature tensors
        self.features = torch.tensor(X, dtype=torch.float32)
        # create label tensors
        self.labels = torch.tensor(y, dtype=torch.long) 

    def __len__(self):
        # we define a method to retrieve the length of the dataset
        return self.features.shape[0]

    def __getitem__(self, idx):
        # necessary override of the __getitem__ method which helps to index our data
        x = self.features[idx]
        y = self.labels[idx]
        return x, y

This is a class that inherits from Dataset and allows the DataLoader, which we will create shortly, to efficiently retrieve batches of data.

The class takes X and y as input.

Training, validation and test datasets

Before proceeding to the following steps, it is important to create training, validation and test sets.

These will help us evaluate the performance of our model and understand the quality of the predictions.

For the interested reader, I suggest reading the article 6 Things You Should Do Before Training Your Model and what is cross-validation in machine learning to better understand why splitting our data into three partitions is an effective method for performance evaluation.

With Sklearn this becomes easy with the train_test_split method.

from sklearn import model_selection

train_ratio = 0.50
validation_ratio = 0.20
test_ratio = 0.20

x_train, x_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=1 - train_ratio)
x_val, x_test, y_val, y_test = model_selection.train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio)) 

print(x_train.shape, x_val.shape, x_test.shape)

>>> (284, 30) (142, 30) (143, 30)

With this small snippet of code we created our training, validation and test sets according to controllable splits.

Data normalization

When doing deep learning, even for a simple task like binary classification, it is always necessary to normalize our data.

Normalizing means bringing all the values of the various columns in the dataset to the same numerical scale. This helps the neural network converge more effectively and thus make accurate predictions faster.

We will use Sklearn's StandardScaler.

from sklearn import preprocessing

scaler = preprocessing.StandardScaler()

x_train_scaled = scaler.fit_transform(x_train)
x_val_scaled = scaler.transform(x_val)
x_test_scaled = scaler.transform(x_test)

Notice how fit_trasform is applied only to the training set, while transform is applied to the other two datasets. This is to avoid data leakage – when information from our validation or test set is unintentionally leaked into our training set. We want our training set to be the only source of learning, unaffected by test data.

This data is now ready to be input to the BreastCancerDataset class.

train_dataset = BreastCancerDataset(x_train_scaled, y_train)
val_dataset = BreastCancerDataset(x_val_scaled, y_val)
test_dataset = BreastCancerDataset(x_test_scaled, y_test)

We import the dataloader and initialize the objects.

from torch.utils.data import DataLoader

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=16,
    shuffle=True,
    drop_last=True
)

val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=16,
    shuffle=False,
    drop_last=True
)

test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=16,
    shuffle=False,
    drop_last=True
)

The power of the DataLoader is that it allows us to specify whether to shuffling our data and in what number of batches the data should be supplied to the model. The batch size is to be considered a hyperparameter of the model and therefore can impact the results of our inferences.

Neural network implementation in PyTorch

Creating a model in PyTorch might sound complex, but it really only requires understanding a few basic concepts.

  1. When writing a model in PyTorch, we will use an object-based approach, like with datasets. It means that we will create a class like class MyModel which inherits from PyTorch's nn.Module class.
  2. PyTorch is an autodifferentiation software. It means that when we write a neural network based on the backpropagation algorithm, the calculation of the derivatives to calculate the loss is done automatically behind the scenes. This requires writing some dedicated code that might get confusing the first time around.

I advise the reader who wants to know the basics of how neural networks work to consult the article Introduction to neural networks – weights, biases and activation

Introduction to neural networks – weights, biases and activation

That said, let's see what the code for writing a logistic regression model looks like.

class LogisticRegression(nn.Module):
    """
    Our neural network accepts num_features and num_classes.

    num_features - number of features to learn from
    num_classes: number of classes in output to expect (in this case, 1 or 2, since the output is binary (0 or 1))
    """

    def __init__(self, num_features, num_classes):
        super().__init__() # initialize the init method of nn.Module

        self.num_features = num_features
        self.num_classes = num_classes

        # create a single layer of neurons on which to apply the log reg
        self.linear1 = nn.Linear(in_features=num_features, out_features=num_classes) 

    def forward(self, x):
        logits = self.linear1(x) # pass our data through the layer
        probs = torch.sigmoid(logits) # we apply a sigmoid function to obtain the probabilities of belonging to a class (0 or 1)
        return probs # return probabilities

Our class inherits from nn.Module. This class provides the methods behind the scenes that make the model work.

init method

The __init__ method of a class contains the logic that runs when instantiating a class in Python. Here we pass two arguments: the number of features and the number of classes to predict.

num_features corresponds to the number of columns that make up our dataset minus our target variable, while num_classes corresponds to the number of results that the neural network must return.

In addition to the two arguments and their class variables, we see the super().__init__() line. The super function initializes the init method of the parent class. This allows us to have the functionality of nn.Module within our model.

Always in the init block, we implement a linear layer called self.linear1, which takes as arguments the number of features and the number of results to return.

forward() method

By writing the forward method we tell Python to override the same method within PyTorch's nn.Module parent class. In fact, this method is called when performing a forward pass – that is, when our data passes from one layer to another.

forward accepts input x which contains the features on which the model will calibrate its performance.

The input passes through the first layer, creating the logits variable. The logits are the neural network calculations that are not yet converted into probabilities by the final activation function, which in this case is a sigmoid. In fact, they are the internal representation of the neural network before being mapped to a function that allows it to be interpreted.

In this case the sigmoid function will map the logits to probabilities between 0 and 1. If the output is less than 0, then the class will be 0 otherwise it will be 1. This happens in the line self.probs = torch.sigmoid(x).

Utility functions for plotting and accuracy calculation

Let's create utility functions to use in the training loop that we will see shortly. These two are used to compute the accuracy at the end of each epoch and to display the performance curves at the end of the training.

def compute_accuracy(model, dataloader):
    """
    This function puts the model in evaluation mode (model.eval()) and calculates the accuracy with respect to the input dataloader
    """
    model = model.eval()
    correct = 0
    total_examples = 0
    for idx, (features, labels) in enumerate(dataloader):
        with torch.no_grad():
            logits = model(features)
        predictions = torch.where(logits > 0.5, 1, 0)
        lab = labels.view(predictions.shape)
        comparison = lab == predictions

        correct += torch.sum(comparison)
        total_examples += len(comparison)
    return correct / total_examples

def plot_results(train_loss, val_loss, train_acc, val_acc):
    """
    This function takes lists of values and creates side-by-side graphs to show training and validation performance
    """
    fig, ax = plt.subplots(1, 2, figsize=(15, 5))
    ax[0].plot(
        train_loss, label="train", color="red", linestyle="--", linewidth=2, alpha=0.5
    )
    ax[0].plot(
        val_loss, label="val", color="blue", linestyle="--", linewidth=2, alpha=0.5
    )
    ax[0].set_xlabel("Epoch")
    ax[0].set_ylabel("Loss")
    ax[0].legend()
    ax[1].plot(
        train_acc, label="train", color="red", linestyle="--", linewidth=2, alpha=0.5
    )
    ax[1].plot(
        val_acc, label="val", color="blue", linestyle="--", linewidth=2, alpha=0.5
    )
    ax[1].set_xlabel("Epoch")
    ax[1].set_ylabel("Accuracy")
    ax[1].legend()
    plt.show()

Model training

Now we come to the part where most deep learning newcomers struggle: the PyTorch training loop.

Let's look at the code and then comment it

import torch.nn.functional as F

model = LogisticRegression(num_features=x_train_scaled.shape[1], num_classes=1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

num_epochs = 10

train_losses, val_losses = [], []
train_accs, val_accs = [], []

for epoch in range(num_epochs):

    model = model.train()
    t_loss_list, v_loss_list = [], []
    for batch_idx, (features, labels) in enumerate(train_loader):

        train_probs = model(features)
        train_loss = F.binary_cross_entropy(train_probs, labels.view(train_probs.shape))

        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()

        if batch_idx % 10 == 0:
            print(
                f"Epoch {epoch+1:02d}/{num_epochs:02d}"
                f" | Batch {batch_idx:02d}/{len(train_loader):02d}"
                f" | Train Loss {train_loss:.3f}"
            )

        t_loss_list.append(train_loss.item())

    model = model.eval()
    for batch_idx, (features, labels) in enumerate(val_loader):
        with torch.no_grad():
            val_probs = model(features)
            val_loss = F.binary_cross_entropy(val_probs, labels.view(val_probs.shape))
            v_loss_list.append(val_loss.item())

    train_losses.append(np.mean(t_loss_list))
    val_losses.append(np.mean(v_loss_list))

    train_acc = compute_accuracy(model, train_loader)
    val_acc = compute_accuracy(model, val_loader)

    train_accs.append(train_acc)
    val_accs.append(val_acc)

    print(
        f"Train accuracy: {train_acc:.2f}"
        f" | Val accuracy: {val_acc:.2f}"
    )

Unlike TensorFlow, PyTorch requires us to write a training loop in pure Python.

Let's see the procedure step by step:

  1. We instantiate the model and the optimizer
  2. We decide on a number of epochs
  3. We create a for loop that iterates through the epochs
  4. For each epoch, we set the model to training mode with model.train() and cycle through the train_loader
  5. For each batch of the train_loader, calculate the loss, bring the calculation of the derivatives to 0 with optimizer.zero_grad() and update the weights of the network with optimizer.step()

At this point the training loop is complete, and if you want you can integrate the same logic on the validation dataloader as written in the code.

Here is the result of the training after the launch of this code

Training in progress. Image by author.

Neural network performance evaluation

We use the previously created utility function to plot loss in training and validation.

plot_results(train_losses, val_losses, train_accs, val_accs)
Performances of the neural network. Image by author.

Our binary classification model quickly converges to high accuracy, and we see how the loss drops at the end of each epoch.

The dataset turns out to be simple to model and the low number of examples does not help to see a more gradual convergence towards high performance by the network.

I emphasize that it is possible to integrate the TensorBoard software into PyTorch to be able to log performance metrics automatically between the various experiments.

Create predictions

We have reached the end of this guide. Let's see the code to create predictions for our entire dataset.

# we transform all our features with the scaler
X_scaled_all = scaler.transform(X)

# transform in tensors
X_scaled_all_tensors = torch.tensor(X_scaled_all, dtype=torch.float32)

# we set the model in inference mode and create the predictions
with torch.inference_mode():
    logits = model(X_scaled_all_tensors)
    predictions = torch.where(logits > 0.5, 1, 0)

df['predictions'] = predictions.numpy().flatten()

Now let's import the metrics package from Sklearn which allows us to quickly calculate the confusion matrix and classification report directly on our pandas dataframe.

from sklearn import metrics
from pprint import pprint

pprint(metrics.classification_report(y_pred=df.predictions, y_true=df.target))
Summary of performance on the entire dataset with a classification report. Image by author.

And the confusion matrix, which shows the number of correct answers on the diagonal

metrics.confusion_matrix(y_pred=df.predictions, y_true=df.target)

>>> array([[197,  15],
       [ 13, 344]])

Here is a small function to create a classification line that separates the classes in the PCA graph

def plot_boundary(model):

    w1 = model.linear1.weight[0][0].detach()
    w2 = model.linear1.weight[0][1].detach()
    b = model.linear1.bias[0].detach()

    x1_min = -1000
    x2_min = (-(w1 * x1_min) - b) / w2

    x1_max = 1000
    x2_max = (-(w1 * x1_max) - b) / w2

    return x1_min, x1_max, x2_min, x2_max

sns.scatterplot(x=x0, y=x1, hue=y)
plt.title("PCA Projection")
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.xticks([])
plt.yticks([])
plt.plot([x1_min, x1_max], [x2_min, x2_max], color="k", label="Classification", linestyle="--")
plt.legend()
plt.show()

And here's how the model separates benign from malignant cells

Classification boundary visualized. Image by author.

Conclusions

In this article we have seen how to create a binary classification model with PyTorch, starting from a Pandas dataframe.

We've seen what the training loop looks like, how to evaluate the model, and how to create predictions and visualizations to aid interpretation.

With PyTorch it is possible to create very complex neural networks … just think that Tesla, the manufacturer of electric cars based on AI, uses PyTorch to create its models.

For those who want to start their deep learning journey, learning PyTorch as early as possible becomes a high priority task as it allows you to build important technologies that can solve complex data-driven problems.


If you want to support my content creation activity, feel free to follow my referral link below and join Medium's membership program. I will receive a portion of your investment and you'll be able to access Medium's plethora of articles on Data Science and more in a seamless way.

Join Medium with my referral link – Andrea D'Agostino

Recommended Reads

For the interested, here are a list of books that I recommended for each ML-related topic. There are ESSENTIAL books in my opinion and have greatly impacted my professional career. Disclaimer: these are Amazon affiliate links. I will receive a small commission from Amazon for referring you these items. Your experience won't change and you won't be charged more, but it will help me scale my business and produce even more content around AI.

Useful Links (written by me)

Tags: Data Science Deep Learning Machine Learning Pytorch Pytorch Tutorial

Comment