How to Use Machine Learning to Inform Design Decisions and Make Predictions

Author:Murphy | View: 20575 | Time: 2025-03-23 11:54:56

Applying Data Science methods and models to a business use case represents the ultimate goal of most data science work. But crossing the gap between data science theory and application is challenging, requiring the data scientist to understand a business domain, unique data associated with that domain, and the needs and requirements of a customer.

This article provides an approach to applying data science methods, such as machine learning, to a notional business use case. Follow this article to learn how to:

Receive a Business Scenario and Data.
Conduct Data Exploration.
Apply a Machine Learning Classification Model.
Make Predictions and Recommendations Based on the Model.

The Scenario:

You work for an automotive company as a data scientist. The company has a reputation for making sporty and quick cars, and is developing a new car for the 1983 model year. The design team has several new drivetrain configurations to choose from, each of which has implications for performance and fuel economy.

New car designs receive testing from environmental test agencies, who evaluate a car's fuel economy in Miles Per Gallon (MPG). Cars that consume excess amounts of fuel receive a "Gas Guzzler Tax," which is a consumption tax. This designation has a negative impact as the tax is passed on to the car's purchaser, making the car less desirable versus alternatives that do not have the tax.

Your company leadership has issued a new, explicit requirement that all future designs avoid the Gas Guzzler Tax designation.

While the US Environmental Protection Agency has its own regulations for a Gas Guzzler Tax [1], for the scenario in this article, cars qualify as a "Gas Guzzler" if they fail to achieve a target MPG of 21 or higher during environmental agency testing.

As the company's data scientist, it is your job to help the design team make informed decisions about their drivetrain configuration before they spend the time and money building different prototypes for real world testing. Fortunately, the company has historical data from its own cars as well as those from competitors to help you provide insights to the design team.

Code and Data:

The code is available at the linked GitHub page. The data used is the Auto MPG dataset, available for use from the UC Irvine Machine Learning Library via a Creative Commons 4.0 License [2]. The data is accessible within Jupyter via the following code:

Python">from ucimlrepo import fetch_ucirepo

# Fetch dataset:
auto_mpg = fetch_ucirepo(id=9)

# Data (as pandas dataframes):
X = auto_mpg.data.features
y = auto_mpg.data.targets

# Make into pandas dataframe:
mpg = pd.concat([X, y], axis=1)

# Fill NaNs with mean:
mpg = mpg.fillna(mpg.mean())

mpg.head()

The dataframe has 398 rows, 8 columns, and looks like this:

This article uses the following Python libraries:

# Data handling:
import pandas as pd

# Data visualization:
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

1. Data Preparation and Exploratory Data Analysis

Let's begin with some data preparation and data exploration. The requirement is to avoid the Gas Guzzler Tax, which kicks in for cars scoring less than 21 MPG during testing. This sets up a classification modeling problem, where the company is interested in whether or not a design proposal has an unfavorable classification as a Gas Guzzler.

Let's take the mpg column and change it to a binary 0 or 1, with 0 attributed to all cars performing at or above 21 MPG and 1 to all cars tagged with the Gas Guzzler Tax. This will allow us to quickly see which cars fail to meet the requirement. The following code prepares the data:

# Add the "Gas Guzzler" column based on the criteria:
mpg['mpg'] = mpg['mpg'].apply(lambda x: 1 if x < 21 else 0)
mpg = mpg.rename(columns={'mpg': 'gas_guzzler_tax'})
mpg.head()

The dataframe now looks like this, with the mpg column renamed _gas_guzzlertax:

Running the value_counts() function on the Gas Guzzler Tax column reveals 227 cars avoid the tax while 171 receive it.

For exploratory data analysis, let's begin with a correlation matrix. The following code generates one:

# Plot a correlation matrix:
plt.figure(figsize=(8, 6))
sns.heatmap(mpg.corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix')

plt.show()

The correlation matrix visualizes relationships between various features in the data. For example, engine displacement has a strong, positive correlation with engine cylinders. This means the data tends to move in the same direction; that is, engines with higher displacement also tend to have more cylinders.

Notice, though, that acceleration is negatively correlated with horsepower as seen by the dark blue box at the intersection of the two features. More horsepower leads to a smaller measured acceleration time to 60 Miles Per Hour.

Recall that the Gas Guzzler Tax is coded as a 1 or 0, with the 1 representing cars with the Gas Guzzler Tax. The correlation matrix shows four features (displacement, cylinders, horsepower, and weight) positively correlate with getting flagged as a Gas Guzzler.

Also note that _modelyear and receipt of the _gas_guzzlertax are negatively correlated, implying newer cars are less likely to receive consumption tax – possibly implying car design is changing over time to favor efficiency or maybe newer technologies are impacting vehicle efficiency.

We can start to see there are conflicts, or tradeoffs, within the features; while adding engine power seems to correlate with the likelihood of being a Gas Guzzler, it also has some desirable impacts such as improved acceleration performance.

A seaborn pairplot can provide a few more insights [3]. Let's look strictly at the positively correlated features: displacement, cylinders, horsepower, weight, and _gas_guzzlertax.

# Create a pairplot:
sns.pairplot(mpg[['displacement', 'cylinders', 'horsepower',
             'weight', 'gas_guzzler_tax']], hue='gas_guzzler_tax')

In the above, we see vehicles which meet our goal of no Gas Guzzler Tax (shown in blue for a value of 0) tend to produce less horsepower, have smaller engines, and weigh less. They also tend to have 4 cylinder motors.

From a simple visual exploration of the data, several initial insights emerge for the design team: weight, engine displacement, horsepower, and number of cylinders are all positively correlated with receipt of the Gas Guzzler Tax; cars which avoid the tax tend to have smaller values for those four features. But, higher values for those features can have positive benefits for consumers in terms of driving performance.

2. Applying Machine Learning Techniques

The structure of the problem and data lends itself to the supervised machine learning technique known as classification modeling. There are various machine learning algorithms available for classification modeling, and this linked Towards Data Science article provides a rundown of some available choices [4].

Let's try running a random forest classification model on the data to see if we can learn more about the data and how the various features impact whether or not a car gets classified as a Gas Guzzler. The following code sets up a Random Forest Classifier:

# Separate the target variable:
y = mpg['gas_guzzler_tax']
X = mpg.drop(columns='gas_guzzler_tax')

# Split into train and test:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Train the model:
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

The above code separates the target variable, _gas_guzzlertax, from the dataframe and uses scikit learn's train_test_split() to prepare the data for machine learning [5]. The scikit learn RandomForestClassifier() model is set and then fit to the training data [6].

The below code takes the fit model and tests it via the predict() function, then produces the accuracy score of the model [7].

# Test the model:
y_pred = rf_model.predict(X_test)

# Return the accuracy score:
rfAccuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", rfAccuracy)

For this model, the initial accuracy is 93.75%. We can further assess performance of a classification model via techniques such as the confusion matrix and classification report:

# Print classification report:
print('Classification Report:n', classification_report(y_test, y_pred))

# Generate the plot:
plt.figure(figsize=(8, 6))

sns.heatmap(confusion_matrix(y_test, y_pred),
            annot=True, cbar=False, fmt='d',
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])

plt.title('Confusion Matrix')

plt.show()

The result is:

The confusion matrix helps assess model performance. While your final product delivery as a data scientist may not include details such as model accuracies and confusion matrices, it is an important step to ensure credibility and soundness of the model used to deliver insights.

As seen from the precision, recall, and accuracy scores and the confusion matrix, the model performs well at identifying true negatives (Predicted 0, actual 0) and identifying true positives (predicted 1, actual 1). For a deep dive into the confusion matrix, reference this linked Towards Data Science article [8].

2.1. Feature Importances

Having built and fit a random forest model, we can now extract the feature importances. Doing so is straightforward using featureimportances from the sklearn library [9]:

# Get the feature importances:
rf_feature_importance = rf_model.feature_importances_
rf_feature_importance = pd.DataFrame(
    {'Feature': X.columns, 'Importance': rf_feature_importance})
rf_feature_importance = rf_feature_importance.sort_values(
    by='Importance', ascending=False)

This gives a table that is visualizable in the following code:

# Plot feature importance:

# Set plot size:
plt.figure(figsize=(10, 6))

# Feature importance:
sns.barplot(data=rf_feature_importance,
            x='Importance', y='Feature', palette='mako')

# Set labels:
plt.title('Feature Importance - Random Forest', fontsize=18, y=1.03)

plt.show()

The plot looks like this:

So what does this mean? Feature importances help explain which features, or columns, have the biggest impact on the model's ability to predict the target variable (_gas_guzzlertax).

In this case, displacement of the engine is the most impactful feature in our model's ability to predict if a car qualifies as a Gas Guzzler. Features like _modelyear and acceleration are less important to our Random Forest Classifier's ability to predict if a vehicle is or is not a Gas Guzzler.

From our exploratory data analysis, we found via a correlation matrix that displacement, cylinders, weight, and horsepower all had strong positive correlations with being classified as a Gas Guzzler. In our Random Forest Classifier, those four features are the most important in the model's ability to predict if a car is a Gas Guzzler or not. This bolsters some of those initial insights on what influences a car's ultimate status regarding the consumption tax.

Note that unlike the correlation matrix, feature importances do not tell us directionality of a feature's influence on the prediction (example: what happens to the prediction if engine displacement increases versus decreases?). Model explainer tools like SHAP may provide further insight into how the features impact the specific prediction; read more at this Towards Data Science article on SHAP values [10].

2.2. Predictions

While exploring the data and building your model, the design team brings their proposals to you. They present four options for the 1983 model year with the intent of releasing a car with a sporty driving experience in line with the company's image. Given the company's experience, the design team is well versed in mathematically estimating engine horsepower and acceleration expectations from design options and transmission gearing. However, the design must meet the company's new requirement of avoiding the Gas Guzzler consumption tax, an unknown for the design team. The designs proposals are:

# Create car design dataframe:
car_designs = pd.DataFrame({
    'displacement': [305, 240, 240, 180],
    'cylinders': [8, 6, 6, 4],
    'horsepower': [225, 200, 190, 170],
    'weight': [3600, 3440, 3350, 3250],
    'acceleration': [7.1, 7.3, 7.6, 7.8],
    'model_year': [83, 83, 83, 83],
    'origin': [1, 1, 1, 1]
})

car_designs

Based on the exploratory data analysis and the trained random forest, we can start to make some assumptions about which car designs are more likely to receive the Gas Guzzler Tax. But instead of assuming, let's make use of that trained Random Forest Classifier to run some predictions:

# Make predictions on the new data points:
predictions = rf_model.predict(car_designs)

print(f'Predicted Class: {predictions}')

The output is:

The model predicts the first two designs (rows) in the car_designs dataframe will receive the Gas Guzzler Tax, while the last two will not.

Recall that the car company has a reputation for building sporty, higher performance cars. The design team really wants to hit the acceleration goal of 7.3 seconds or better, so they propose re-working the second design with advanced materials that slightly reduce weight. This will also marginally improve the acceleration estimate. Fortunately, the Random Forest model can take user-created inputs and predict whether or not the vehicle will be a Gas Guzzler through the following code:

# Get user input for each feature:
displacement = int(input("Enter displacement: "))
cylinders = int(input("Enter cylinders: "))
horsepower = int(input("Enter horsepower: "))
weight = int(input("Enter weight: "))
acceleration = float(input("Enter acceleration: "))
model_year = int(input("Enter model year: "))
origin = int(input("Enter origin: "))

# Create new data point for prediction:
car_design = pd.DataFrame({
    'displacement': [displacement],
    'cylinders': [cylinders],
    'horsepower': [horsepower],
    'weight': [weight],
    'acceleration': [acceleration],
    'model_year': [model_year],
    'origin': [origin]
})

# Make a prediction:
prediction = rf_model.predict(car_design)

print(f'Prediction: {prediction}')

We now see that the second row design, with a lower weight, meets the guidance of avoiding the Gas Guzzler Tax while also delivering strong acceleration performance.

3. Analysis Results and Next Steps

The following insights are now available to the design team, backed by data analysis and machine learning modeling:

Engine displacement, engine cylinders, vehicle weight, and horsepower positively correlate with a vehicle being a Gas Guzzler.
Increasing the values for some or all of those four features increases likelihood of the vehicle being a Gas Guzzler.
Machine Learning reinforces the importance of those four features, as they were the most important features influencing model predictions of whether or not a vehicle is a Gas Guzzler.
Of the four candidate designs, Machine Learning modeling predicts two of the four will fail to meet the company's requirements.
Modifying the second design to have a more aggressive weight target results in the model predicting the car will meet the company's requirements.

These are good insights for the design team, but how can we improve the Data Science approach?

Feature Selection:

Note that for this example, we ran all of the data features through the model. However, feature importances and the correlation matrix indicate we might have some opportunities to reduce the feature space.

For example, origin and _modelyear are both low in importance to the Random Forest Classifier. One could choose to leave origin because it may imply different geographical design considerations and regulatory requirements, thus having an influence on the design. But the company location is also beyond control of the design team.

The _modelyear feature is similar in that it is beyond control. However, newer model years could imply more modern technologies and options or changing regulations and consumer tastes. For example, consider how vehicle weight has changed as a function of year, showing that model year has implications on feature data for a car:

# Set plot size:
plt.figure(figsize=(10, 6))

# Create scatter plot:
sns.boxplot(data=mpg, x="model_year",
            y="weight", palette='mako')

# Set labels:
plt.title('Year versus Weight',
          fontsize=18, y=1.03)
plt.xlabel('Year', fontsize=13)
plt.ylabel('Weight', fontsize=13)

plt.show()

Regardless of the decision on which features to keep, the analysis underpinning the decision to include or discard a feature must be defensible and methodologically credible.

Additional Features and Data:

There are numerous other features impacting the decision-space not included in the dataset, mainly because the dataset, which is a toy or experimentation dataset for learning, does not include them. But in a real-world scenario, the design team could calculate or provide other data features such as Coefficient of Drag estimates or estimated Vehicle Frontal Area. These both impact fuel efficiency via aerodynamics as a vehicle moves through the air.

For a business, pricing info is also important. Recall for our scenario that the second design option failed until receiving a revision with a lower weight. Perhaps this weight loss came via more exotic materials that increase price. While the new design meets the core Gas Guzzler avoidance requirement, is there a price threshold requirement?

Finally, remember that the mpg dataset from the UCI repository is a low-dimensional dataset meant primarily for learning. Ideally, real-world data would have higher dimensionality and more observations.

Alternate and Advanced Modeling Techniques:

This example only used a Random Forest Classifier, but there are other machine learning algorithms up for the task of classification. Multiple models could be used in an ensemble approach to gain deeper understanding of features.

The problem could be approached with a Random Forest Regressor or other Regression modeling techniques, predicting continuous values of MPG for a design versus approaching the problem as a binary 1 or 0 classification model.

This example used a Random Forest Classifier with no hyperparameter tuning or adjustments for imbalance. In the example data, the target variables had 227 instances of 0 versus 171 of 1, representing a mild imbalance. Bigger datasets may have even greater imbalance in the target variables, requiring techniques to address this. Additionally, with bigger real-world datasets, hyperparameter tuning may be necessary after analyzing model performance.

Production Pipelines and Applications:

This article presented a Jupyter Notebook-based example. A complete solution for a business use case might include a production pipeline on an analytics platform with a scheduler that automatically pulls new data in from a database or user inputs, runs the trained model on the new data, and provides predictions to the design team via a web-based interface.

3. Conclusion:

The goal of this article was to showcase an example of taking a notional business problem and applying data science techniques to provide insights. Through the application of exploratory data analysis and machine learning, descriptive and even predictive analytics become a possibility for aiding business processes. Feel free to download, use, and modify the code available at the linked GitHub page – thank you for reading!