Outlier Detection using Random Forest Regressors: Leveraging Algorithm Strengths to your Advantage

Author:Murphy | View: 23520 | Time: 2025-03-23 12:34:38

Using a model's robustness to outliers to detect them

Problem Statement

The problem of outlier detection can be tricky, especially if the ground truth or the description of what is an outlier is ambiguous or based upon multiple factors. Mathematically speaking, an outlier can be defined as data points more than three standard deviations away from a mean. However, in most real-life problems, not all data points away from a mean are of the same significance, sometimes we require a bit more nuance when flagging outliers.

Let's take a quick example:

We have a dataset of water consumption per household. By analyzing the water consumption as a whole and isolating points 3 standard deviations from the mean, we can quickly get the outliers that use the most water.

This however fails to take into account the reason behind the increase in consumption, i.e. there could be multiple reasons why the water consumption is high, some reasons are of more interest than others. For example, the abnormal consumption could be due to an abnormally large house with many facilities, in that sense, the reason for the increased water consumption is not hidden and easily explained, and hence we can say it is not of interest here. Let us build further on our example and say we want to study households whose water consumption makes no sense even after considering the household parameters, these data points are arguably of more worthy of investigation, since we do not know what is the source of the anomalous overconsumption. Hence, a responsible municipality would investigate the needless overconsumption for leaks or inefficiencies.

The need then arises for a way to detect outliers in relation to a subset of the dataset and not for the dataset as a whole. There are many ways this can be achieved; one simple way is to plot 2D or 3D scatter plots between our target variable and feature(s) in the dataset. If the plot shows points that are away from the majority of points, we have found outliers in the target variable while taking into consideration the features of our dataset.

Depending on our goals, and case by case outlier definition, we might want to define differently, In this case since we want the context of outliers to be households that consume more water than their context would normally call for, then we would focus on points below and away from y = x line. Intuitively, we want our model to shape itself like that, instead of having a cutoff at a certain threshold of water consumption.

While some people would generate these 2D scatter plots for each variable, that is certainly an option, we can use the strength of some ML algorithms to our advantage. What if we use a machine learning model that is insensitive to outliers to detect if a point is an outlier according to our entire feature set?

The idea:

By utilizing an algorithm that's designed to prevent overfitting, such as Random Forest Regression, the predictions for outlier data points can exhibit high residual errors. This is because algorithms like Random Forests prioritize generalizing well to the majority of the data, rather than fitting exactly to each data point. In the case of outliers, which significantly deviate from the typical data patterns, the algorithm tends to produce predictions that are notably different from the ground truth values. Random Forests, for instance, often return predictions as the average of target values for similar data points, which can be seen as an attempt to estimate what the target value might be if that data point were not an outlier. This means we can use the residual error of a random forest regression to detect outliers.

Background

Decision Trees

Decision trees are a fundamental machine learning model used for both classification and regression tasks. They function by recursively partitioning the data into subsets based on feature attributes, ultimately forming a tree-like structure of decision rules. At each internal node, a decision is made based on a feature that splits the data optimally, and the tree branches into child nodes corresponding to different feature values. This process continues until a stopping criterion is met, such as a maximum tree depth or a minimum number of data points per leaf. Decision trees are praised for their high prediction speed, which remains largely independent of the dataset's size and also their ability to form complex non-linear boundaries that can fit quite well to non-linear data

Decision trees are interpretable, making them valuable for understanding feature importance and data patterns. Interpretability is a large problem for most machine learning models that prevent them from being deployed in many sensitive fields where the prediction of a model must be explained and interpreted before any further action is taken. One of such fields is fraud detection; the decision made by a model to flag a transaction as fraud can be a costly one that causes serious inconvenience to all parties involved. Hence, a claim that a transaction was fraudulent must be highly explainable, and that is why decision trees and decision tree-based algorithms are so widely popular in the field of fraud detection.

You may now be asking, why don't we just use decision trees to detect outliers? Well, unlike random forest algorithms, the singular decision tree that is used to build a random forest is extremely prone to overfitting. This is because of the way optimal splits are found. In the case of regression, an optimal split is a split that reduces the Root Mean Squared Error (RMSE) of predictions. If left alone, a decision tree might split until we have one data point in each leaf, leading to an RMSE of zero. This is why pruning a tree or setting a maximum depth can be crucial when tuning tree algorithms.

Pruning a tree refers to cutting tree branches until a satisfactory balance between low RMSE and tree size is met; this is usually adding by tuning a hyperparameter (usually named alpha) that acts as a penalty term for our cost function which penalizes having larger trees. Even after using pruning, decision trees still might have some limitations, one of such is the algorithm ignoring certain features due to other features continuously leading to better splits, this leads to a decision tree that does not consider all features in the dataset, this in our case is exactly what we are aiming to avoid. Another drawback of singular decision trees comes from the greedy nature of their algorithm, decision trees greedily decide the optimum split at each node without considering other decisions that may lead to better partitioning over the long term. This means that the root of a tree has serious influence over the rest of the partitioning process and may prevent us from reaching the most optimal overall split due to greedy decisions.

Random Forest Regressors

Random forest regressors are ensemble learning techniques that utilize an ensemble of weak learners, specifically shallow decision trees (meaning they do not overfit). This algorithm overcomes the shortcomings of a singular decision tree by generating multiple shallow decision trees (publications suggest 64–128 trees) over different subsets of the data. For each tree that is being built, a subset of the data points is taken into consideration as well as a subset of the features, usually using bootstrapping (sampling with replacement). This means that for each tree there might be duplicate data points and duplicate features that it considers, this further adds to the diversity of the trees generated and helps more with generalization.

Image by Author illustrating Bootstrapping columns to train a single tree

"Random forests does not overfit. You can run as many trees as you want. It is fast." – Random Forests, Leo Breiman and Adele Cutler.

Let's put our code where our mouth is

We will use the Chicago Multi-Family Residential Building and Census 2017 (CMRBC 2017) dataset, which was collected and analyzed in a study by Seyrfar et al. (2021)

Summary

Using Random Forest Regression, we were able to set up an outlier insensitive regressor to predict water consumption in our dataset. We isolated our outliers by selecting the rows for which the regressors' prediction was wrong by 3 standard deviations off the mean of the residual errors of our model. Detecting the outliers using the residuals of our model allowed us to isolate outliers that consume more water consumption than the average of similar households that live in the same leaf of our trees.

We were able to go from a threshold that looks like this:

To something that is more aware of the context and looks like this:

In conclusion, we were able to differentiate anomalous water consumption in relation to other household parameters. This differentiation allows us to exclude certain data points where the higher consumption is justified and include certain households where the consumption, relative to the properties, is anomalous. This understanding can lead to more efficient resource allocation and targeted interventions, ultimately benefiting both consumers and resource management initiatives.

Honorable Mentions

There are other ways to detect anomalies using tree-like algorithms, most notably Isolation Forests. In a nutshell, Isolation Forests calculate how anomalous a data point is by assessing how quickly it can be isolated from the rest of the dataset. They operate on the basis that anomalies will be separated and isolated early on at shallower levels of the tree. Therefore, a data point that is isolated at a shallower level of the tree receives a higher anomaly score.

For those interested in alternative approaches to Anomaly Detection, Isolation Forests offer a compelling method worth exploring further.