Custom Scoring Functions in Scikit-Learn

Author:Murphy  |  View: 29681  |  Time: 2025-03-23 12:02:24

RandomizedSearchCV, GridSearchCV, and cross_val_score are all tools to optimize and evaluate Machine Learning models in scikit-learn. Each of these tools offers a systematic approach to hyper-parameter tuning and model performance assessment.

For a long time, I used these tools out of the box without any consideration for the scoring function. However, I eventually learned that when using these tools, scikit-learn defaults to the model's inherent scoring function to evaluate performance. The default scoring metric isn't always appropriate, which can lead to misinformed decisions regarding the model.

The remainder of this article will delve into how and when to utilize custom scoring functions in scikit-learn.

Example: Tweedie Regression on Insurance Claims

In this example, we develop a regressor for predicting future insurance claim costs, a task complicated by the inherent uncertainty in insurance data. Uncertainty in insurance data stems from a couple of places.

  • Once someone has purchased an insurance policy, there is no guarantee they will ever file a claim. This leads to a high concentration of zeroes in the target.
  • If someone does file a claim, the size of that claim could be large or small. This leads to a large variance in our target variable.

By default, RandomizedSearchCV, GridSearchCV, and cross_val_score use the default scoring metric associated with the classifier or regressor passed to it. For many widely used regressors traditional metrics like R² and RMSE are the default scores. However, using these metrics to make decisions about parameter tuning or evaluate model performance for insurance data will often lead to incorrect decisions and results.

Consequently, when working with insurance data we need to ensure we pass an appropriate scoring function to these tools to ensure we correctly set the model's parameters and evaluate the model's performance. More generally, any time you use these tools you should ensure that the scoring function being used is appropriate for your data. In some cases, you might just want to use a scoring function different than the default.

In the next few subsections, we will build a tweedie regressor using light GBM, RandomizedSearchCV , and a custom scoring function. This example is intended to demonstrate how to use scoring functions in tools like RandomizedSearchCV, GridSearchCV, or cross_val_score . This example is not intended to provide a detailed overview of machine learning model development, hyper-parameter tuning, or produce a good model.

Simulating our Dataset

Due to the sensitive nature of insurance data, it's often hard to find publicly available claims history datasets. For that reason, we have simulated our own dataset using the code provided below. The dataset is intended to simulate insurance claims. Each row represents a policyholder where for each policyholder we simulate the number of claims they have incurred, as well as the size of any claims they have incurred. Simulating the dataset ensures the data has a few defining characteristics.

  • policy_exposure, or how long the person has held the policy, is uniformly distributed between 1 and 12 months.
  • num_claims, or how many claims someone has filed, is poisson distributed. This ensures a right skewed distribution, where the majority of people haven't filed any claims.
  • claim_cost, or the total amount of an individuals claims, is gamma distributed. The claim cost is zero if the number of claims is 0.
import pandas as pd
import numpy as np

np.random.seed(42)

n_samples = 50000 

policy_exposure = np.random.uniform(1, 12, n_samples)

n_features = 10
X = np.random.randn(n_samples, n_features)

annual_claim_rate = 0.05  
monthly_claim_rate = annual_claim_rate / 12  
num_claims = np.random.poisson(lam=monthly_claim_rate * policy_exposure)

claim_cost = np.zeros(n_samples)
positive_cost_indices = num_claims > 0
high_variance_gamma = np.random.gamma(2, 3000, size=positive_cost_indices.sum())
claim_cost[positive_cost_indices] = high_variance_gamma

claims_df = pd.DataFrame(
    {
        "policy_exposure": policy_exposure,
        "num_claims": num_claims,
        "claim_cost": claim_cost,
    }
)

features_df = pd.DataFrame(X, columns=[f"feature_{i+1}" for i in range(n_features)])
df = pd.concat([claims_df, features_df], axis=1)

num_claims_std_dev = 2
for i in range(1, 6):
    df[f"feature_{i}"] += np.random.normal(0, num_claims_std_dev * i, n_samples)

claim_cost_std_dev = 2
for i in range(6, 11):
    df[f"feature_{i}"] *= np.random.normal(1, claim_cost_std_dev * (i - 5), n_samples)

Building our Regressor

The first thing we will do is import our light GBM ("LGBM") regressor. We will set the regression objective to tweedie. The evaluation metric is independent of the objective.

from lightgbm import LGBMRegressor

tweedie_regressor = LGBMRegressor(objective="tweedie", verbose=0)

features = [x for x in df.columns if "feature" in x]

target = "claim_cost"

weight = "policy_exposure"

The Tweedie distribution is a type of statistical distribution that encompasses a range of other well-known distributions, including the Normal, Poisson, and Gamma distributions. The Tweedie distribution's power parameter determines the specific form of the distribution within the Tweedie family. A power parameter between 1 and 2 leads to a compound Poisson-Gamma distribution.

In our case, since we simulated our data, we already know that the number of claims follow a poisson distribution and the size of these claims follow a Gamma distribution. For that reason, we will set up a randomized grid search to find the optimal parameter value between 1 and 2.

RandomizedSearchCV with Default Scoring

When building tweedie models on insurance data, one common evaluation metric is mean_tweedie_deviance. However, our initial grid search will intentionally use the estimators default scoring function (R² in this case) to select the optimal parameter. Once the search is complete, we will calculate the mean_tweedie_deviance so that we can compare it to future grid searches designed to optimize mean_tweedie_deviance.

from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    "tweedie_variance_power": [x / 100 for x in range(100, 200)],
}

fit_params = {"sample_weight": df[weight]}

random_search = RandomizedSearchCV(
    estimator=tweedie_regressor,
    n_iter=50,
    param_distributions=param_distributions,
    cv=5,
    verbose=2,
    random_state=21,
    n_jobs=-1,
)

random_search.fit(df[features], df[target], **fit_params)

We confirm that the models score is equal to R².

from sklearn.metrics import r2_score

best_power = random_search.best_params_["tweedie_variance_power"]

best_estimator = random_search.best_estimator_

best_estimator.fit(df[features], df[target], sample_weight=df[weight])

model_score = best_estimator.score(df[features], df[target])

r2 = r2_score(df[target],best_estimator.predict(df[features]))

print(model_score == r2)

# True

We can alsoe view the optimal power parameter. Interpreting mean_tweedie_deviance on it's own can be tricky. However, generally the lower the better. We will use this as value as our baseline to compare to future searches.

deviance = mean_tweedie_deviance(
    df[target],
    best_estimator.predict(df[features]),
    power=best_power,
    sample_weight=df[weight],
)

print(f"Best Power: {best_power}")
print(f"mean_tweedie_deviance: {deviance}")

# Best Power: 1.29
# mean_tweedie_deviance: 159.49

RandomizedSearchCV using make_scorer

Without additional evaluation, the results using the default scorer look okay. However, let's imagine we're experienced data scientists and we recognize that using R² to tune our model is inappropriate. We decide we want to switch our scoring function to mean_tweedie_deviance .

To make this available to RandomizedSearchCV , we use scikit-learn's make_scorer. From the Scikit-learn documentation, make_scorer is a function to "make a scorer from a performance metric or loss function." In other words, this function allows us to use any function available in sklearn.metrics as a scoring function for use in RandomizedSearchCV, GridSearchCV, or cross_val_score. There are a few other available arguments in the documentation, however the only one we need is greater_is_better to tell our search that lower mean_tweedie_deviance is better.

It's relatively easy to set up a scoring metric using make_scorer.

tweedie_loss = make_scorer(mean_tweedie_deviance, greater_is_better=False)

With our new scoring function, we run our search in the same way.

param_distributions = {
    "tweedie_variance_power": [x / 100 for x in range(100, 200)],
}

fit_params = {"sample_weight": df["policy_exposure"]}

random_search = RandomizedSearchCV(
    estimator=tweedie_regressor,
    n_iter=10,
    param_distributions=param_distributions,
    scoring=tweedie_loss,
    cv=5,
    verbose=2,
    random_state=21,
    n_jobs=-1,
)

random_search.fit(df[features], df["claim_cost"], **fit_params)

Again, we can view the results of our grid search.

best_power = random_search.best_params_["tweedie_variance_power"]

best_estimator = random_search.best_estimator_

best_estimator.fit(df[features], df[target], sample_weight=df[weight])

deviance = mean_tweedie_deviance(
    df[target],
    best_estimator.predict(df[features]),
    power=best_power,
    sample_weight=df[weight],
)

print(f"Best Power: {best_power}")
print(f"mean_tweedie_deviance: {deviance}")

# Best Power: 1.29
# mean_tweedie_deviance: 159.49

The results are the same as those from the default scorer, but why? When we create our custom scoring function using make_scorer, the search doesn't use the whatever power it's testing in the calculation of mean_tweedie_deviance. Instead, it uses the default power. According to the scikit-learn documentation, the default power for mean_tweedie_deviance is 0. Mean_tweedie_deviance with a power of zero reduces to root mean squared error, which leads to the same power selection as when we use the default scoring metric.

def weighted_mse(y_true, y_pred, weights):
    weighted_squared_diff = weights * (y_true - y_pred) ** 2

    weighted_mean_squared_error = np.sum(weighted_squared_diff) / np.sum(weights)

    return weighted_mean_squared_error

mse = weighted_mse(df[target],
    best_estimator.predict(df[features]),df[weight])

deviance = mean_tweedie_deviance(
    df[target],
    best_estimator.predict(df[features]),
    power=0,
    sample_weight=df[weight],
)

print(f"Weighte MSE: {mse}")
print(f"Deviance (Power = 0): {deviance}")

# Weighte MSE: 1566555.33
# Deviance (Power = 0): 1566555.33

If we were focused on tuning other LGBM parameters and already knew the power we wanted, we could use a scoring function like the one below. However, since we are tuning the power parameter, this is not possible for us.

tweedie_loss = make_scorer(mean_tweedie_deviance, greater_is_better=False, power=1.5)

RandomizedSearchCV using Custom Scoring Object

Instead of using make_scorer, we can write our own function and use it as our scoring metric. As long as the function signature is (y_true, y_pred) or (estimator, X, y), scikit-learn tools like RandomizedSearchCV, GridSearchCV, or cross_val_score can accept the function as a scoring function.

For our example, we elect to design our scoring function with the signature (estimator, X, y) . Having the estimator available in the function allows us to access the estimator's power to correctly calculate mean_tweedie_deviance during the search.

def tweedie_loss(estimator, X, y):
    power = estimator.get_params()["tweedie_variance_power"]
    y_pred = estimator.predict(X)
    return -mean_tweedie_deviance(y, y_pred, power=power)

Now, we simply pass this function to our search!

param_distributions = {
    "tweedie_variance_power": [x / 100 for x in range(100, 200)],
}

fit_params = {"sample_weight": df["policy_exposure"]}

random_search = RandomizedSearchCV(
    estimator=tweedie_regressor,
    n_iter=10,
    param_distributions=param_distributions,
    scoring=tweedie_loss,
    cv=5,
    verbose=2,
    random_state=21,
    n_jobs=-1,
)

random_search.fit(df[features], df["claim_cost"], **fit_params)

Like before, let's look at the optimal power.

best_power = random_search.best_params_["tweedie_variance_power"]

best_estimator = random_search.best_estimator_

best_estimator.fit(df[features], df[target], sample_weight=df[weight])

deviance = mean_tweedie_deviance(
    df[target],
    best_estimator.predict(df[features]),
    power=best_power,
    sample_weight=df[weight],
)

print(f"Best Power: {best_power}")
print(f"mean_tweedie_deviance: {deviance}")

# Best Power: 1.62
# mean_tweedie_deviance: 35.30

We notice that the power is a slightly higher, and the mean_tweedie_deviance is much lower! Since mean_tweedie_deviance is a more appropriate metric for this model, we can feel good about this result.

You may have noticed that when we calculate our mean_tweedie_deviance after our search is complete, we pass sample weights as an input. However, in our custom scoring object, we don't use any sample weights. Passing additional arguments to custom scoring objects is necessary if the required information isn't available in any of the variables in the required signature. Although it's possible to provide additional inputs, it requires building your own k-fold validation methodology, which is beyond the scope of this post.

Conclusion

In this post, we explored the complexities of customizing scoring metrics in RandomizedSearchCV. The concepts covered in this article extend to additional tools like GridSearchCV, or cross_val_score. Additionally, we discussed a few limitations of this approach. Remember, it's important to consider that the default scoring metrics might not always be appropriate for the task at hand!

Tags: Artificial Intelligence Data Science Hands On Tutorials Machine Learning Programming

Comment