Farm to Table: The Workflow of a Classification Model

Author:Murphy | View: 29988 | Time: 2025-03-22 22:14:33

Introduction

A typical machine learning workflow rarely involves applying one single approach to the problem at hand. Models generally go through an iterative process with various techniques applied and evaluated. Feature engineering strategies are tested, discarded, then revisited; algorithms and their parameters are iterated exhaustively, sometimes for just a fraction of a percentage improvement. This cyclical process of experimentation and refinement is essential in working towards a robust solution.

The following article is a demonstration of a typical workflow in preparing, testing, comparing, and scoring a classification model for a given problem. In this example, the product team of a hypothetical cooking website is attempting to improve their current system of selecting recipes for the website's front page, by implementing a machine learning system based on past performance of recipes they've manually selected. To that end, two algorithms are applied – a Logistic Regression and Random Forest classifier – then evaluated and compared to the manual approach as a key performance indicator. The details of this project in full can be seen below.

Recipe Recommender System

Problem Definition

The product team has requested a classification model with the ability to correctly recommend recipes that produce high traffic on the website, to replace their current selection process for the site. For this purpose a synthetic dataset is employed (which can be found at the link to this project's GitHub folder here), comprising of recipes containing data on the recipe category, nutritional metrics such as calories, carbohydrate, sugar and protein levels, the serving size of the recipe, and the target variable high_traffic (indicating that the recipe generated high traffic on the website). Based on this, their request is to produce a classification model which can recommend popular recipes for display on the website at least 80% of the time.

To begin, the problem can be further defined by noting that the product team is only interested in the model's ability to predict high traffic recipes well, with little to no interest in the model's performance on low traffic recipes. Therefore this can be understood as a request for a classification model with high precision, and true positive predictions should be maximised in this context.

Step 1: Import Libraries and Data

Firstly, I import the required libraries for data analysis, manipulation, and visualization: numpy (np), pandas (pd), matplotlib.pyplot (plt), seaborn (sns), and missingno (msno). Missingno is an effective library for determining the distribution of missing values across the dataset, making it useful for the data cleaning process.

#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

I also import functions for machine learning preprocessing, model instantiation and training, and model evaluation. As the precision of the model is important to the product team, the precision metric will be the most important metric for the product team's requirements.

#preprocessing
from sklearn.preprocessing import LabelEncoder

#model preparation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

#models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

#metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

Following this, the data is imported from a csv file using the read_csv() function in pandas. The head of the data is printed to confirm the data has been loaded to the pd dataframe, df. Immediately, it's clear there are missing values in the nutritional columns (calories, carbohydrate, sugar, protein), which will need to be filled with appropriate values.

First five rows of the data (image by author)

Step 2: Data Validation

With the data loaded into the workspace, the next step is to describe the dataset and understand its general shape and information.

Data description and statistics printouts (image by author)

Immediately there are some notable observations from the dataframe information; for example, the calories column has the highest range of values, with a min of 0.14 and a max of 3705.82. There are 52 missing values in the nutritional columns. It will be important to investigate the spread of these missing data. Additionally, the high_traffic column appears to have 373 missing values – this will also need to be investigated. Lastly, the information on the servings column indicates it's an object rather than a float or an int – this suggests it may have string data in some rows. This will need to be standardised for later model training.

To further investigate the values in the dataset, I define a custom function named print_uniques() to print all unique values and their counts from each column of the dataframe. The function uses a generator object to iterate through each column, to show unique values and their counts. For each column, the function prints the column name, unique values, and the number of unique values.

#function to print all unique values and their counts from each column
def print_uniques(df):
    #for large datasets - using generator for speed
    uniques_generator = ((x, df[x].unique(), df[x].nunique()) for x in df.columns)

    print('nUnique Values:')
    for x, unique_values, num_unique in uniques_generator:
        print(f"{x}: n {unique_values} n ({num_unique} unique values)")

#printing uniques of selected columns
print_uniques(df[['category', 'servings','high_traffic']])

Printout of unique values for the specified columns (image by author)

It's clear from the output above that the servings column has entries of ‘4 as a snack' or ‘6 as a snack', which will need to be cleaned. Additionally, the high_traffic column only has one unique value ‘High'; presumably the missing values are all instances where the traffic was low, so these will need to be filled with the ‘Low' values.

Section 3: Data Cleaning

With an assessment of the data structure completed, msno can now be used to visualise the missing data. The missing values in the high_traffic column appear to be structurally missing data, therefore it's reasonable to fill the missing data in this column using the fillna() function (specifying ‘Low' for any missing values). For the missing values in the nutritional column, the dataframe is first sorted by calories and then visualised using the msno matrix() function.

#fill missing values in high_traffic column with 'Low'
df['high_traffic'] = df['high_traffic'].fillna('Low')

#sort df by 'calories' column
df_sorted = df.sort_values('calories')

#create matrix of missing values
msno.matrix(df_sorted)
plt.show()

Msno matrix of the full dataset (image by author)

I investigate this further by checking whether the missing values are common across only the nutritional columns.

#using msno to visualise only missing values
msno.bar(df[df.isna().any(axis=1)])
plt.show()

Msno matrix of the missing nutritional values (image by author)

The missing data appears to be common across all nutritional columns whenever there is missingness; i.e. when the calories data is missing then the carbohydrate, sugars, and protein data is also missing. Given the small number of data points in the dataset (947 rows), it wouldn't be a good strategy to drop these values (which represent over 5% of the data points.) A more intelligent strategy will need to be applied for these missing data.

The initial data cleaning tasks will be addressed first (changing the servings data type, getting the per-serving nutritional values, etc.) then the missing data will be appropriately filled. The first function for this purpose is to clean the servings column. This is done by taking the first value of the serving column and converting it to an integer; thereby stripping the ‘as a snack' string from some of the values.

def keep_first_character(value):
    return int(str(value)[0])

#apply function to servings column
df['servings'] = df['servings'].apply(keep_first_character)
print(df['servings'].unique()) #print unique values to verify

Next, a function is defined to calculate the per-serving values of each of the nutritional columns (‘calories', ‘protein', ‘sugar', and ‘carbohydrate'). The function sets up a new column of each nutritional value divided by the servings column. This is then applied to all nutritional columns, captured in a list to apply to the function.

#define function to calculate per serving values
def calculate_per_serving(df, columns):
    # looping through columns
    for column in columns:
        df[f'{column}_per_serving'] = df[column] / df['servings']

#defining nutritional columns
nutritional_columns = ['calories', 'protein', 'sugar', 'carbohydrate']

#calculate per serving values
calculate_per_serving(df, nutritional_columns)

With the serving data cleaned and the per-serving nutritional columns created, it's now possible to apply reasonable values to the missing data in the dataset.

I have previously established that only values in the nutritional columns are missing, and other columns (i.e. serving, category) do not have missing data for these rows. Therefore it's possible to apply the median nutritional value per categorical value to the missing data, as can be seen below (initially I applied mean values to the missing values, however through trial and error it was found that the median value led to better metrics in the final model.)

First this is applied to the per-serving nutritional columns, then scaled up for the total nutritional columns by multiplying the serving size by the per-serving value.

#apply median per_serving value to the missing values
for category in df['category'].unique():
    for column in nutritional_columns:
        fill_value = df.loc[df['category'] == category, f'{column}_per_serving'].median()
        print(f"Median {column} for {category}: {(fill_value).round(2)}") #values rounded for display
        df.loc[(df['category'] == category) & (df[f'{column}_per_serving'].isnull()), f'{column}_per_serving'] = fill_value

#multiply per serving column by servings to fill missing values
for column in nutritional_columns:
    df[column] = df[f'{column}_per_serving'] * df['servings']

Median nutritional values for each category, to be used in filling empty values in the missing data (image by author)

With the data now sufficiently cleaned, it's now possible to conduct univariate and multivariate analysis on the data to understand its distributions.

Section 4: Analysis & Visualisation

Beginning the analysis, I investigate the split of high traffic across different categories – this is done by creating a countplot per category, and plotting the results using matplotlib. The x-axis labels are rotated 45 degrees for better legibility.

#plotting countplots for category
plt.figure(figsize=(10, 6))
sns.countplot(x='category', data=df, order=df['category'].value_counts().index)
plt.title('Category Counts')
plt.xticks(rotation=45)
plt.show()

Countplot of recipes segmented by category (image by author)

The countplot of categories can be further broken down between high traffic and low traffic recipes, to show the variation in popular recipes between different categories.

Additionally, a Seaborn heatmap is plotted using crosstab, for an annotated comparison of high and low traffic recipes within each category.

#plotting countplots for category vs high_traffic
plt.figure(figsize=(10, 6))
sns.countplot(x='category', hue='high_traffic', data=df, palette=['green', 'red'], order=df['category'].value_counts().index)
plt.title('Category vs High Traffic')
plt.xticks(rotation=45)
plt.show()

#plotting crosstab of category and high_traffic
cat_traffic = pd.crosstab(df['category'], df['high_traffic'])
plt.figure(figsize=(10, 6))
sns.heatmap(cat_traffic, annot=True, cmap='Oranges', cbar=False)
plt.title('Category vs High Traffic')
plt.show()

Countplot of recipes segmented by category and high_traffic (image by author)

Crosstab graphic showing count of recipe category by high traffic (image by author)

The Beverages category has the most recipes in the dataset at 106 recipes, while the Chicken Breast category has the lowest at 71. The Breakfast category appears to be the most popular among visitors to the site, with the majority of the recipes in that category seeing high traffic. Similarly it's clear that the Meat and Potato categories are also very popular, with these recipes showing high traffic to the site. By contrast, the One Dish Meal category has the highest number of low traffic recipes, followed by Beverages and Lunch/Snacks.

Step 5: Machine Learning Preprocessing

Beginning with the preprocessing of the classification model, the per serving columns are dropped from the dataset. A copy of the dataframe is then created for the preprocessing steps.

#dropping unnecessary columns
drop_cols = ['calories_per_serving', 'protein_per_serving', 'sugar_per_serving', 'carbohydrate_per_serving']
df = df.drop(drop_cols,axis=1)

#creating copy of df for ml
df_ml = df.copy()

Prior to the machine learning preprocessing, it's beneficial to show histograms of the nutritional columns, to indicate how the initial data is distributed.

#plotting histograms of nutritional columns
df_ml[nutritional_columns].hist(figsize=(10, 10))

Histograms of the nutritional columns (image by author)

The histograms show the nutritional values are all heavily right skewed; for many machine learning models, this skewness will need to be transformed to use this for model training. Therefore this will need to be standardised prior to any model training being conducted.

The next step involves applying a log transformation to the nutrition columns. Within a loop, each column undergoes the transformation using NumPy's log1p() function, applying a natural logarithm plus one to handle zero values. It's important to note that while this is a common method of scaling skewed data, there are no zero values in the nutritional columns, therefore applying a simple log function would also be fine. Additionally, each transformed column is renamed to reflect the applied transformation using the rename() function.

#applying log transformation to nutritional columns
for column in nutritional_columns:
    df_ml[column] = np.log1p(df_ml[column])
    #rename column
    df_ml.rename(columns={column: f'log1p_{column}'}, inplace=True)

Once this transformation is complete, histograms are once again called to check the new distribution of the transformed data.

Histograms of the transformed data (image by author)

The calories and carbohydrate columns appear to have benefitted considerably from the transformation, now showing as approximately normally distributed under the log1p transformation. Some benefit can also be seen in the sugar and protein columns, which are now closer to normal distribution. With appropriate scaling applied to the numerical features, the categorical features must now be encoded before loading to the chosen classification models.

Following the transformations applied to the numerical columns, a list of categorical columns to encode is defined, containing ‘category', and ‘servings'. These columns contain qualitative information that need to be transformed into numerical representations for certain classification algorithms to process effectively. Important to note that the ‘high_traffic' column (i.e. the target column) is separately encoded, as it only has 2 distinct values, and it was considered necessary to make certain that the ‘High' values were considered as the positive target variable for precision metrics. The servings column is treated as a category below, as trial and error has shown that keeping servings as a value in the dataset has a positive impact on some of the classification model metrics.

Following this, a LabelEncoder object is instantiated; label encoding is a method used to convert categorical data into numerical labels, assigning a unique integer to each distinct category within a column. This transformation allows the classification model to interpret categorical data. As the code loops, the LabelEncoder is applied to each categorical column in the list. For each iteration, the LabelEncoder's fit_transform() method is used to encode the categorical values of the respective column, replacing them with their corresponding numerical labels.

#replace high_traffic column with 1 and 0
df_ml['high_traffic'] = df_ml['high_traffic'].replace({'High': 1, 'Low': 0})

#instantiate LabelEncoder
label_encoder = LabelEncoder()

#list of categorical columns to encode
cat_columns = ['category', 'servings']

#iterate through each column in the list
for column in cat_columns:
    #apply LabelEncoder to encode the column
    df_ml[column] = label_encoder.fit_transform(df_ml[column])

Finally before moving on to the model training step, the recipe column is dropped from the dataset.

#drop recipe column
drop_cols = ['recipe']
df_ml = df_ml.drop(drop_cols,axis=1)

Step 6: Model Selection & Evaluation

With the data now sufficiently preprocessed, I now proceed with building and training the classification models. Two models are selected for training, to compare precision results against each other and the random choice baseline – a LogisticRegression classifier and a RandomForest classifier. Logistic Regression is chosen due to its ease in working with small datasets, and its fast runtimes. Similarly, Random Forest is a versatile algorithm which is unlikely to overfit, making it a good alternative model choice for this problem.

The first model tested on the dataset is a LogisticRegression classifier. The model training process begins by creating a copy of the dataframe df_ml for classification purposes, named df_model. The target variable is extracted from df_model, and stored as y. Similarly the target variable is removed from df_model, and the resulting DataFrame is stored in X.

I split the data into training and testing sets using train_test_split(). The split is stratified based on the ‘category' column of df_model, with a test size of 20% and a fixed random state of 42. The train-test sets are stratified in this way to give sets that are proportional to the categories in the original dataset, as this was found to slightly boost model metrics. A Logistic Regression model is instantiated and a parameter grid for the Logistic Regression model is defined, encompassing various hyperparameters such as regularization strength, penalty, solver, and maximum iterations, among others. This is done to give the cross validation process a greater chance of finding the best hyperparamaters for the model.

I then instantiate a Randomized Search Cross Validation (RandomizedSearchCV) to efficiently explore the hyperparameter space; randomized search is subsequently executed to find the best hyperparameters, maximising precision using 5 fold cross validation. Following this, the best precision score and corresponding best hyperparameters found during the search are printed.

#creating copy of the dataframe for classification
df_model = df_ml.copy()

#extract target variable
y = df_model['high_traffic']

#drop target variable from the dataframe
X = df_model.drop(['high_traffic'], axis=1)

#split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=df_model['category'], test_size=0.2, random_state=42)

#instantiate Logistic Regression model
logistic_regression = LogisticRegression()

#define parameter grid for Logistic Regression
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2', 'elasticnet', 'none'],
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'max_iter': [100, 200, 300, 400, 500],
    'fit_intercept': [True, False],
    'class_weight': [None, 'balanced'],
    'warm_start': [True, False],
    'multi_class': ['auto', 'ovr', 'multinomial'],
    'random_state': [None, 42]
}

#printing model eval callout
print("Evaluating Logistic Regression:")

#instantiate randomised search CV
random_search_lr = RandomizedSearchCV(logistic_regression, param_distributions=param_grid, n_iter=20, 
                                      scoring='precision', n_jobs=-1, cv=5, random_state=42)

#fit randomised search CV
random_search_lr.fit(X_train, y_train)

#best model
best_model_lr = random_search_lr.best_estimator_

#best precision score
best_precision_lr = random_search_lr.best_score_

#best hyperparameters
best_params_lr = random_search_lr.best_params_

print("nBest Precision:", best_precision_lr)
print("Best Parameters:", best_params_lr)

Evaluation of the Logistic Regression model (image by author)

It's evident that the Logistic Regression model performs very well on the data, with a best precision of 81.48% using the best parameters listed above; this is above the goal precision metric of 80% as requested by the product team.

Now this can be compared against the alternate model; Random Forest is subsequently trained on the dataset to view its performance. As before, the model training process begins by creating a copy of the dataframe df_ml and stored as df_model. The target variable, ‘high_traffic', is extracted from df_model and removed from the features in X. The data is then split into training and testing sets, using the same stratification methodology as with the Logistic Regression model.

A initialise a Random Forest classifier and define a parameter grid, with hyperparameters including the number of estimators, maximum depth of trees, minimum samples for splitting, and minimum samples for leaf nodes, among others. RandomizedSearchCV is then instantiated to run through the hyperparameters in the parameter grid. As with the Logistic Regression model, a randomised search is conducted to find the best hyperparameters to maximise precision using 5 fold cross validation. Subsequently the best precision score and corresponding best hyperparameters found during the search are printed.

#creating copy of the dataframe for classification
df_model = df_ml.copy()

#extract target variable
y = df_model['high_traffic']

#drop target variable from the dataframe
X = df_model.drop(['high_traffic'], axis=1)

#split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=df_model['category'], test_size=0.2, random_state=42)

#instantiate random forest classifier
rf_classifier = RandomForestClassifier()

#define parameter grid for random forest
param_grid = {
    'n_estimators': [50, 100, 200, 300, 400],
    'max_depth': [3, 4, 5, 6, 7],
    'min_samples_split': [2, 3, 4, 5, 6],
    'min_samples_leaf': [1, 2, 3, 4, 5],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False],
    'criterion': ['gini', 'entropy']
}

#printing model eval callout
print("Evaluating Random Forest:")

#instantiate randomised search CV
random_search_rf = RandomizedSearchCV(rf_classifier, param_distributions=param_grid, n_iter=20, 
                                      scoring='precision', n_jobs=-1, cv=5, random_state=42)

#fit randomised search CV
random_search_rf.fit(X_train, y_train)

#best model
best_model_rf = random_search_rf.best_estimator_

#best precision score
best_precision_rf = random_search_rf.best_score_

#best hyperparameters
best_params_rf = random_search_rf.best_params_

#printing model metrics
print("nBest Precision:", best_precision_rf)
print("Best Parameters:", best_params_rf)

Evaluation of the Random Forest model (image by author)

As can be seen, the Random Forest did not perform as well as the Logistic Regression model, generating a best precision value of 78.82%; this is also marginally less than the goal precision value requested by the product team. Now that both models are trained and optimised with randomised search cross validation, it's possible to compare their performance against the expected results of a "random choice" recommendation.

Photo by Sebastian Coman Photography on Unsplash

The random choice probability of recommending a high traffic model is defined as the proportion of high traffic recipes in the dataset – if a recipe were selected at random for display on the website, this would be the likelihood that it generated high traffic. I compare the precision metrics of each model against the random choice probability, to put these results in the appropriate business context.

#calculating chance of randomly picking high traffic recipe
random_choice = df[df['high_traffic'] == 'High'].count()['high_traffic'] / len(df['high_traffic'])
print(f"Possibility of choosing high traffic recipe at random: {(100*random_choice).round(2)}%")

#displaying final metrics for logistic regression
print("nLogistic Regression - Metrics:")
print(f"Final precision: {(100*best_precision_lr).round(3)}%")
print(f"Percentage improvement over random choice: {(100*(best_precision_lr/random_choice)-100).round(3)}%")

#displaying final metrics for random forest
print("nRandom Forest - Metrics:")
print(f"Final precision: {(100*best_precision_rf).round(3)}%")
print(f"Percentage improvement over random choice: {(100*(best_precision_rf/random_choice)-100).round(3)}%")

A comparison of Logistic Regression and Random Forest metrics to a random choice approach (image by author)

As can be seen above, both models are considerably improved over the random choice probability, with the Logistic Regression model performing 34.4% better and the Random Forest model performing 30.04% better than random choice. Evidently Logistic Regression is the model of choice, having demonstrated a final precision of 81.48% in line with the request of the product team.

Step 7: Conclusions & Recommendations

From an analytical perspective, the Breakfast category appears to be the most popular among visitors to the site, with the majority of the recipes in that category seeing high traffic. Similarly it's clear that the Meat and Potato categories are also very popular, with these recipes showing high traffic to the site. As such these categories should be the priority, if manually choosing recipes for the website. By contrast, the One Dish Meal category has the highest number of low traffic recipes, followed by Beverages and Lunch/Snacks. Therefore these categories should be deprioritised, if manually choosing recipes for the website.

Two classification models were fitted to the data, a RandomForest classifier and a LogisticRegression classifier. The Logistic Regression model performed better compared to the Random Forest model, generating a final precision of 81.48% compared to 78.82%. A comparison of the Logistic Regression model against random choice shows the final model is 34.43% better, indicating this is a considerably more optimal approach than current business practices. Similarly the Random Forest model also outperformed random choice, with a 30.04% improvement. Therefore the Logistic Regression model above is the preferred model, as it satisfies the requirements of the product team to recommend high traffic recipes 80% of the time. As such this is the model the Data Science team will recommend for use by the product team.

Photo by Sam Moghadam Khamseh on Unsplash

In terms of the classification model, I would recommend that the product team gather more recipes to allow for greater fine tuning of the model. The final iteration of the classification model resulted in an precision score of 81.48%, which is in line with the product team's request from the data science team. With further recipe data, the model may be trained to recommend recipes to an even higher precision than has already been demonstrated. I noted during data analysis that the Beverages category is the most common in the dataset, however this category is also one of the most unpopular; therefore the product team should strive to increase the number of recipes in the dataset for other categories, to allow for a larger selection of diverse recipes on which to train the model.

Similarly I would advise the product team to add more data on other features, to supplement the data already given in the dataset. For example, features such as estimated recipe preparation time, number of ingredients, recipe difficulty etc. may help in increasing model precision even further than the current model precision, leading to better recommendations for the website.

Final Thoughts

In summary, the process of constructing a Machine Learning model for the above cooking website's recipe selection system illustrates the intricate nature of this kind of workflow. From initial data preprocessing to model selection and evaluation, the process demands a blend of creativity, analytical rigour, and (especially) patience. The work behind constructing this type of system is highly iterative, with many revisions to be expected throughout the training process. Ultimately, with enough perseverance and ingenuity, this process bears fruit with a well designed and high performing machine learning model.

Tags: Data Science Hands On Tutorials Logistic Regression Machine Learning Random Forest