How Bias and Variance Affect Your Model

Author:Murphy | View: 27796 | Time: 2025-03-22 19:13:28

Introduction

Ever since I started migrating to Data Science I heard about the famous Bias versus Variance tradeoff.

But I learned it enough to move on with my studies and never looked back too much. I always knew that a highly biased model underfits the data, while a high-variance model is overfitted, and that any of those are not good when training an ML model.

I also know that we should look for a balance between both states, so we'll have a good fit or a model that generalizes the pattern well to new data.

But I might say I never went farther than that. I never searched or created highly biased or highly variant models just to see what they actually do to the data and how the predictions of those models are.

That is until today, of course, because this is exactly what we're doing in this post. Let's proceed with some definitions.

High Bias

Oversimplifying = Hit anything with the hammer | Google Gemini, 2024. https://gemini.google.com

Bias is the difference between the prediction of our model and the true value that we are trying to predict. A highly biased model underfits the data, failing to capture underlying patterns.

Models with High Bias perform poorly on both training and test data.

Model Behavior: The model oversimplifies and misses the correct relationship.
Analogy: Think about this problem like a student who oversimplifies topics, skips important details, and gets consistently low scores everywhere.
Example: Misses key information, like predicting house prices using only the number of bedrooms and ignoring location.
Example of Model Prediction: If the true pattern is to add 1 for every unit increase, a high bias model might always add 0.5, or worse, not add anything at all, completely ignoring the input.

Let's see an example with code.

# Imports
import numpy as np
import matplotlib.pyplot as plt

# Generating high bias data
x = np.linspace(0, 10, 50)
y = 2 * np.sin(x)

# Plot
plt.scatter(x, y, label="True Nonlinear Data")
plt.plot(x, 0.5 * x, color='g', label="High Bias Linear Model")
plt.legend()
plt.show()

This is what a highly biased model looks like. We are trying to fit a linear regression to non-linear data. We're oversimplifying the pattern, trying to make a line fit to a curve.

High Bias model: trying to fit a curve with a line

If we go one step ahead, create a Linear Regression model, and then make some predictions, we will see what happens.

from sklearn.linear_model import LinearRegression
import pandas as pd

# Fit
lm = LinearRegression().fit(x.reshape(-1, 1), y)
# Predict
preds = lm.predict(x.reshape(-1, 1))

# Performance
(pd
 .DataFrame({'Actual': y, 
             'Predicted': preds, 
             'Difference %': (y-preds)/y})
 .head(10)
 .T
 )

Look how the predictions are way off. The model can't recognize the pattern in this data.

Now it is time to understand models with high variance.

High Variance

Unpredictable outcomes based on minor fluctuations in the data | Google Gemini, 2024.

Variance is the variability of the model's predictions for different training datasets. It measures how much the predictions change if the training data changes. A high-variance model overfits the training data, capturing noise instead of general patterns.

A model with high variance performs well on training data but poorly on unseen data.

Model Behavior: The model is too sensitive to the training data and overfits. High variance indicates the model is too complex and overfits the data.
Analogy: Like a student who memorizes answers without understanding concepts, performing great in practice but poorly in real-world tests.
Example: Captures too much noise, like trying to predict the stock market but mistaking random fluctuations for trends.
Example of Model Prediction: If the true pattern is to add 1 for every unit increase, a high variance model might learn to add 100, or even unpredictable values based on minor fluctuations in the data.

Let's code an example now.

# Imports
import numpy as np
import matplotlib.pyplot as plt

# Generating high variance data
x = np.linspace(0, 10, 50)
y = 2 * np.sin(x) + np.random.normal(0, 2, len(x))  # Adding noise

# Plot
plt.scatter(x, y, label="High Variance Data")
plt.plot(x, 2 * np.sin(x), color='r', label="True Pattern")
plt.legend()
plt.show()

The model with high variance is very noisy. Notice how the noisy data fluctuates significantly around the true curve, making it challenging for a model to learn the correct relationship without overfitting.

High Variance model: noisy data fluctuates significantly around the true curve.

Let's create a Decision Tree model, which is very well-known for overfitting, and assess the behavior of a high-variance model.

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

# Generate high variance data
np.random.seed(42)
x = np.linspace(0, 10, 100).reshape(-1, 1)
y_true = 2 * np.sin(x).ravel()
y_high_variance = y_true + np.random.normal(0, 3, len(x))  # Adding noise

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y_high_variance, 
                                                    test_size=0.2,
                                                    random_state=42)

# Fit a decision tree regressor (high variance model due to overfitting)
tree_high_variance = DecisionTreeRegressor(max_depth=None, random_state=42)  # No depth limit increases variance
tree_high_variance.fit(x_train, y_train)

# Predict using the model
x_range = np.linspace(0, 10, 300).reshape(-1, 1)  # Fine range for smooth predictions
y_pred = tree_high_variance.predict(x_range)

# Plot the data and model predictions
plt.figure(figsize=(10, 6))
plt.scatter(x_train, y_train, color='blue', label='Training Data', alpha=0.7)
plt.scatter(x_test, y_test, color='green', label='Test Data', alpha=0.7, marker='x')
plt.plot(x_range, y_pred, color='orange', label='High Variance Model Predictions', linewidth=2)
plt.plot(x, y_true, color='red', label='True Pattern', linewidth=2, linestyle='--')
plt.title("High Variance Model with Decision Tree")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.show()

The outcome is displayed next.

High-variance model: overfitted to the training data; poor job on test data.

We see in the previous picture that the data is overfitted to the training data, but it does not do well for test data. Let's see the numbers.

# Predictions Train set
train_preds = tree_high_variance.predict(x_train)

(pd
 .DataFrame({'Actual': y_train,
             'Predicted': train_preds, 
             'Difference %': 100*(y_train-train_preds)/y_train})
 .head(10)
 .T
)

# Predictions test set
preds = tree_high_variance.predict(x_test)

(pd
 .DataFrame({'Actual': y_test,
             'Predicted': preds, 
             'Difference %': 100*(y_test-preds)/y_test})
 .head(10)
 .T
)

As expected, the results on the train set are completely overfitted. The model was 100% accurate.

But it couldn't generalize the pattern, performing poorly on the test set.

Tradeoff

To solve those problems, the idea is to find the middle ground between a model that is too simple and another that is too complex.

Increasing model complexity reduces bias, getting closer fit to training data, but increases variance, making the model more sensitive to noise. On the other hand, simplifying the model reduces variance, becoming less sensitive to noise, and increases bias, possibly missing true patterns.

Analogy

Think of bias as a rigid system (like a straight ruler) and variance as a flexible one (a bendable ruler).

A rigid ruler won't capture curves (high bias).
A bendable ruler may bend too much for minor imperfections, overcomplicating the shape (high variance).

The ideal tool is one that balances rigidity and flexibility to represent the true shape (patterns in data).

In Practice

In practical terms, what can be done to balance this tradeoff?

High Bias Models

Using more complex algorithms
Cross Validation
Feature engineering
Add non-linear terms
Fine-tuning hyperparameters
Use Boosting ensemble methods (XGB, Gradient Boosting), because it reduces bias by correcting errors iteratively.

High Variance Models

Simplify the model
Cross Validation
Adding regularization
Use Bagging **** ensemble methods (Random Forest), because it reduces variance by averaging predictions.

Before You Go

This conceptual article is necessary for us to build a deeper understanding of this important topic for data science. Knowing what the Bias vs. Variance Tradeoff is and how each type affects modeling is a valuable knowledge that enables us to take corrective actions accordingly.

Every modeling decision – like choosing algorithms, regularization, and hyperparameter tuning – implicitly aims to balance bias and variance to minimize total error. This balance is crucial for building robust and generalizable Machine Learning models.