Robust Statistics for Data Scientists Part 2: Resilient Measures of Relationships Between Variables

Author:Murphy | View: 22047 | Time: 2025-03-22 22:31:18

Grasping the interconnections among variables is essential for making data-driven decisions. When we accurately evaluate these links, we bolster the trustworthiness and legitimacy of our findings, crucial in both scholarly and practical contexts.

Data scientists frequently turn to Pearson's correlation and linear regression to probe and measure variable relationships. These methods presume data normality, independence, and consistent spread (or homoscedasticity) and perform well when these conditions are met. However, real-world data scenarios are seldom ideal. They're typically marred by noise and outliers, which can skew the results of traditional statistical techniques, leading to incorrect conclusions. This piece, the second in our series on robust statistics, seeks to navigate these obstacles by delving into robust alternatives that promote more dependable insights, even amidst data irregularities.

In case you have missed the first part:

Robust Statistics for Data Scientists Part 1: Resilient Measures of Central Tendency and…

Pitfalls of Traditional Methods

Pearson's Correlation is a statistical method designed to capture the extent of association between two continuous variables, employing a scale that ranges from -1, indicating perfect inverse proportionality, to +1, representing perfect direct proportionality, with the neutral point 0 reflecting a lack of any discernible relationship. This method assumes that the variables in question adhere to a normal distribution and maintain a linear relationship. However, it is noteworthy that Pearson's correlation is very sensitive to outliers, which can significantly skew the estimated correlation coefficient, resulting in a potentially misleading representation of the relationship's intensity or lack thereof.

On the other hand, Linear regression seeks to delineate the relationship between a dependent variable and one or more independent variables by constructing a linear equation that closely aligns with the data observed. The process of estimating the coefficients through the least squares method is particularly vulnerable to outliers. Such oultiers have the potential to unduly influence the regression line, affecting the accuracy of the slope and intercept estimations, and consequently, leading to a misrepresented portrayal of the variables' relationship.

To offer a more tangible example of these concepts in action, let us revisit a simple dataset, similar to that used in the first part of our discussion on roubust methods, which delineates the relationship between study hours and exam scores within a student cohort. Generally, a consistent trend is observed, where increased study duration correlates with enhanced exam performance. Yet, there might exist instances of deviation from this norm, such as a student who, despite studying for unusually long hours, scored disproportionately low due to illness during the exam.

In the simulation that follows, we will explore this scenario in greater detail, initially by examining the distortion an outlier introduces in linear regression analysis, and subsequently, by assessing its repercussion on Pearson's correlation, thereby providing a comprehensive understanding of how such outliers can impact the interpretation of statistical data.


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import pearsonr,spearmanr, kendalltau
from sklearn.linear_model import (LinearRegression, HuberRegressor, RANSACRegressor,
                                  TheilSenRegressor)
import statsmodels.api as sm

# Seed for reproducibility
np.random.seed(0)

# Simulating study hours and exam scores
study_hours = np.random.normal(5, 2, 100)
exam_scores = 10 + study_hours * 3 + np.random.normal(0, 5, 100)

# Introducing an outlier
study_hours[98] = 15  # Unusually high study hours
exam_scores[98] = 20  # Disproportionately low score

# Plotting the data
plt.figure(figsize=(10, 6))
plt.scatter(study_hours[:-1], exam_scores[:-1], color='skyblue', alpha=0.7, label='Data Points', s = 50)  # Normal data points
plt.scatter(study_hours[98], exam_scores[98], color='red', label='Outlier', edgecolor='black', s=50)  # Outlier
plt.xlabel('Study Hours')
plt.ylabel('Exam Scores')
plt.title('Study Hours vs. Exam Scores with Outlier')
plt.legend()
plt.show()

The plot above highlights an outlier that diverges from the predominantly linear relationship observed between the number of hours dedicated to study and the resulting exam scores. To elucidate the influence exerted by this outlier, we will apply Pearson's correlation to the dataset both inclusive and exclusive of this anomalous data point. This comparative analysis aims to shed light on how a single outlier can skew the perceived strength of the relationship between study hours and exam scores, potentially leading to misleading conclusions about the true nature of this association.

# Remove the outlier for the comparison
study_hours_wo_outlier = np.delete(study_hours, 98)
exam_scores_wo_outlier = np.delete(exam_scores, 98)

# Calculate Pearson's correlation with and without the outlier
corr_coef_with_outlier, _ = pearsonr(study_hours, exam_scores)
corr_coef_wo_outlier, _ = pearsonr(study_hours_wo_outlier, exam_scores_wo_outlier)

# Create a table for comparison
comparison_table = pd.DataFrame({
    "Metric": ["Pearson's Correlation"],
    "With Outlier": [corr_coef_with_outlier],
    "Without Outlier": [corr_coef_wo_outlier]
})

comparison_table

When the outlier is included in the analysis, Pearson's coefficient shows a diminished correlation between study hours and exam scores, not accurately reflecting the stronger relationship evident among the bulk of the data points. This discrepancy highlights the sensitivity of Pearson's correlation to anomalies within the dataset.

Next, we will explore the impact of this outlier on a simple linear regression model. By incorporating the outlier into the dataset and applying linear regression, we aim to observe how this single anomalous data point can influence the regression line, potentially altering the estimated slope and intercept, and thus affecting the interpretation of the relationship between study hours and exam scores. This examination will provide insight into the robustness of linear regression in the presence of outliers and its implications for Data Analysis.

# Reshape data for sklearn
X = study_hours.reshape(-1, 1)
y = exam_scores

# Linear regression with all data points
model_with_outlier = LinearRegression()
model_with_outlier.fit(X, y)
predictions_with_outlier = model_with_outlier.predict(X)

# Remove the outlier for the second model
mask = np.arange(len(X)) != 98  # Create a mask to filter out the outlier
X_without_outlier = X[mask]
y_without_outlier = y[mask]

# Linear regression without the outlier
model_without_outlier = LinearRegression()
model_without_outlier.fit(X_without_outlier, y_without_outlier)
predictions_without_outlier = model_without_outlier.predict(X)

# Plotting the data points
plt.figure(figsize=(10, 6))
plt.scatter(study_hours[:-1], exam_scores[:-1], color='skyblue', alpha=0.6, label='Data Points', s=50)
plt.scatter(study_hours[98], exam_scores[98], color='red', label='Outlier', edgecolors='black', s=50)

# Plotting regression lines
plt.plot(study_hours, predictions_with_outlier, color='red', label='Regression with Outlier')
plt.plot(study_hours, predictions_without_outlier, color='blue', label='Regression without Outlier', linestyle='-')

plt.xlabel('Study Hours')
plt.ylabel('Exam Scores')
plt.title('Linear Regression with and without Outlier')
plt.legend()
plt.show()

The plot highlights how outliers can drastically influence linear regression models, causing a single aberrant data point to shift the (red) regression line and potentially distort the interpretation of the core relationship within the dataset.

Robust Correlation Coefficients

To mitigate the impact of outliers in correlation analysis, one can adopt more resilient statistical methods such as Spearman's Rank Correlation and Kendall's Tau.

Spearman's Rank Correlation assesses the degree to which the association between two variables can be modelled by a monotonic function, prioritising the ranking of data points rather than their actual numeric values. Such an approach reduces the undue impact of outliers by focusing on the relative positions of data points, thereby offering a more robust alternative to conventional correlation assessments in datasets where outliers are present.

Kendall's Tau, another non-parametric approach, also measures the strength and direction of the association between two variables. It accomplishes this by assessing the discrepancy between the likelihood of two randomly chosen data pairs being in concordant order versus in discordant order, offering a nuanced perspective on the consistency of the relationship between variables, unaffected by the extreme values that might compromise other statistical analyses.

To validate the robustness of these methods in the face of outliers, we will apply both Spearman's Rank Correlation and Kendall's Tau to correlate study hours and exam scores. By comparing the results obtained from these analyses with those derived from Pearson's correlation, we can better appreciate the resilience of these alternative approaches in maintaining their reliability even when outliers are present in the data.

spearman_corr_with_outlier, _ = spearmanr(study_hours, exam_scores)
spearman_corr_wo_outlier, _ = spearmanr(study_hours_wo_outlier, exam_scores_wo_outlier)
kendall_corr_with_outlier, _ = kendalltau(study_hours, exam_scores)
kendall_corr_wo_outlier, _ = kendalltau(study_hours_wo_outlier, exam_scores_wo_outlier)

# Create a table for comparison
comparison_table = pd.DataFrame({
    "Metric": ["Pearson's Correlation", "Spearman's Correlation", "Kendall's Tau"],
    "With Outlier": [corr_coef_with_outlier, spearman_corr_with_outlier, kendall_corr_with_outlier],
    "Without Outlier": [corr_coef_wo_outlier, spearman_corr_wo_outlier, kendall_corr_wo_outlier]
})

comparison_table

The data presented in the table illustrates that the introduction of the outlier leads to a decrease in all three correlation coefficients, with Spearman's and Kendall's Tau exhibiting a relatively smaller change compared to Pearson's correlation. This demonstrates their stronger resilience to outliers, underscoring the importance of using these robust methods in analyses, particularly when dealing with datasets that include outliers that cannot be readily excluded. The lesser change in Kendall's Tau, despite its generally lower statistical power compared to the other two metrics, further emphasises the robustness of these non-parametric measures against the influence of aberrant data points.

Robust Regression Techniques

When linear regression estimates are skewed by outliers, employing robust alternatives like Huber regression and Quantile regression can offer more reliable results.

Huber Regression effectively diminishes the impact of outliers by modifying the loss function, integrating squared losses for data points that are in proximity to the model's predictions, while adopting absolute losses for those significantly distant.

Quantile Regression, in contrast, shifts the focus towards estimating the median (or other quantiles) of the response variable, rather than the mean. This method provides a broader perspective on the possible outcomes and their associated probabilities, offering insights into the distribution's tails where outliers typically reside.

To assess the efficacy of these methods, we will apply both Huber and Quantile regression to model the relationship between exam scores and study hours, subsequently comparing their results with those derived from traditional Linear Regression.

# Function for Quantile Regression
def quantile_regression(X, y, quantile):
    model = sm.QuantReg(y, sm.add_constant(X))
    fitted_model = model.fit(q=quantile)
    predictions = fitted_model.predict(sm.add_constant(X))
    return predictions

# Reshape data for sklearn
X = study_hours.reshape(-1, 1)
y = exam_scores

# Huber Regression with all data points
huber_model_with_outlier = HuberRegressor()
huber_model_with_outlier.fit(X, y)
huber_predictions_with_outlier = huber_model_with_outlier.predict(X)

# Huber Regression without the outlier
X_without_outlier = X[mask]
y_without_outlier = y[mask]

huber_model_without_outlier = HuberRegressor()
huber_model_without_outlier.fit(X_without_outlier, y_without_outlier)
huber_predictions_without_outlier = huber_model_without_outlier.predict(X_without_outlier)

# Quantile Regression with all data points
quantile_predictions_with_outlier = quantile_regression(X.ravel(), y, 0.5)

# Quantile Regression without the outlier
quantile_predictions_without_outlier = quantile_regression(X_without_outlier.ravel(), y_without_outlier, 0.5)

# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(study_hours[:-1], exam_scores[:-1], color='skyblue', alpha=0.6, label='Data Points', s=50)
plt.scatter(study_hours[98], exam_scores[98], color='red', label='Outlier', edgecolors='black', s=50)

# Plotting ground truth
plt.plot(study_hours, predictions_without_outlier, color='black', label='Ground truth', linestyle='--')

# Plotting regression with outlier
plt.plot(study_hours, predictions_with_outlier, color='red', label='Regression with Outlier')

# Plotting Huber regression lines
plt.plot(study_hours, huber_predictions_with_outlier, color='blue', label='Huber with Outlier')

# Plotting Quantile regression lines
plt.plot(study_hours, quantile_predictions_with_outlier, color='orange', label='Quantile with Outlier')

plt.xlabel('Study Hours')
plt.ylabel('Exam Scores')
plt.title('Huber and Quantile Regression vs Linear Regression')
plt.legend()
plt.show()

In the plot above, the linear regression model fitted to data without the outlier (dashed black line), can be considered as a proxy for the ‘ground truth' and serves as a benchmark against which other models can be compared. Notably, the regression lines for both Huber and Quantile regressions appear unaffected by the outlier, demonstrating close alignment with the ‘ground truth' line. This indicates their robustness and ability to resist the skewing effects of outliers. For comparative purposes, the linear regression model that includes the outlier is also shown, depicted in red. This contrast between the slopes highlights the significant deviation caused by the outlier in the standard linear regression model and underscores the effectiveness of Huber and Quantile regressions dealing with anomalies.

Advanced Robust Regression Techniques

While Huber regression is effective for mild outliers, it its performance can degrade when faced with a dataset characterised by a significant number of outliers or particularly extreme outliers. This limitation stems from its loss function, which is piecewise robust, meaning it offers robustness up to a certain threshold beyond which its effectiveness may diminish. On the other hand, Quantile regression stands out for its resilience to outliers in the response variable, yet it falls short in adequately addressing outliers present among the predictors or independent variables.

To tackle more challenging outlier scenarios effectively, advanced robust techniques like RANSAC (Random Sample Consensus) and Theil-Sen estimators come into play. These methods are designed to offer an enhanced capacity for dealing with a wide range of outlier conditions, making them suitable for datasets where traditional robust methods falter.

To explore the capabilities of these advanced techniques, we will begin by simulating data that includes extreme outliers in both the dependent and independent variables. This simulated dataset will provide a testing ground to demonstrate how RANSAC, Theil-Sen can effectively navigate complex outlier scenarios.

# Simulating study hours and exam scores
study_hours = np.linspace(0, 20, 100)
exam_scores = 3 * study_hours + np.random.normal(10, 5, size=study_hours.shape)

# Add outliers in both dependent and independent variables
study_hours_outliers = np.linspace(45, 50, 20)
exam_scores_outliers = 2 * (study_hours_outliers - 45) + np.random.normal(0, 5, size=study_hours_outliers.shape)

study_hours_with_outlier = np.concatenate([study_hours, study_hours_outliers])
score_with_outlier = np.concatenate([exam_scores, exam_scores_outliers])

# Plot the dataset
plt.figure(figsize=(10, 6))
plt.scatter(study_hours, exam_scores, color='skyblue', alpha=0.6, label='Data Points',edgecolor='black', s=50)
plt.scatter(study_hours_outliers, exam_scores_outliers, color='red', label='Outliers', edgecolor='black', s=50)
plt.xlabel("Study Hours")
plt.ylabel("Exam Scores")
plt.title("Simulated Dataset with Extreme Outliers in the Dependent and Indepented Variable")
plt.legend()
plt.show()

The RANSAC regression is a pivotal technique widely used across Data Science and computer vision for its efficacy at distinguishing between inliers (data points that align with a hypothesised model) and outliers. RANSAC operates on a simple yet effective principle: it iteratively selects a random subset of the original dataset, under the assumption that this subset comprises solely inliers, to estimate the model parameters. Each iteration involves fitting a model to this random subset and then evaluating this model against the entire dataset to ascertain the count of data points that conform to the model within a specified tolerance level – these are identified as inliers. The model that garners the broadest consensus, evidenced by the highest tally of inliers, is then adopted as the optimal model.

Conversely, the Theil-Sen estimator represents a robust non-parametric method for deducing the slope among a collection of two-dimensional data points. The crux of the Theil-Sen estimator's methodology involves calculating the slopes between every possible pair of points within the dataset, subsequently deriving the median of these slopes to serve as the slope estimate. The reliance on the median as a measure of central tendency provides the estimator with a formidable resistance to outliers, given the median's inherent robustness against extreme values.

Moving from theory to application, we shall now delve into the practical handling of data contaminated with numerous and pronounced outliers. Our exploration commences with a comparative analysis between the RANSAC regression and Huber regression, aiming to illustrate the distinct approaches and their respective efficacies in dealing with severe outlier anomalies.

# RANSAC Regression
ransac = RANSACRegressor()
ransac.fit(study_hours_with_outlier.reshape(-1, 1), score_with_outlier)

# Huber Regression 
huber_model_with_outlier = HuberRegressor()
huber_model_with_outlier.fit(study_hours_with_outlier.reshape(-1, 1), score_with_outlier)
huber_predictions_with_outlier = huber_model_with_outlier.predict(study_hours_with_outlier.reshape(-1, 1))

# Predict and plot
plt.figure(figsize=(10, 6))
plt.scatter(study_hours, exam_scores, color='skyblue', alpha=0.6, label='Data Points',edgecolor='black', s=50)
plt.scatter(study_hours_outliers, exam_scores_outliers, color='red', label='Outliers', edgecolor='black', s=50)
plt.plot(study_hours_with_outlier, ransac.predict(study_hours_with_outlier.reshape(-1, 1)), 'g', label="RANSAC Fit")
plt.plot(study_hours_with_outlier.reshape(-1, 1), huber_predictions_with_outlier, color='blue', label='Huber Fit')
plt.xlabel("Study Hours")
plt.ylabel("Exam Scores")
plt.title("RANSAC Regression vs Huber Regression")
plt.legend()
plt.show()

The above plot showcases the capability of RANSAC regression (green line) in accurately fitting the data despite the presence of numerous and extreme outliers. This contrasts with traditional robust methods like Huber regression, depicted by the blue line, which struggles to accurately delineate the underlying data trend amidst such pronounced anomalies.

Next, we turn our attention to the Theil-Sen estimator to evaluate its performance in outlier-rich scenarios and to compare its effectiveness against Quantile regression. We will explore how this non-parametric method stands up to the challenges posed by outliers and how its outcomes juxtapose with those derived from Quantile regression.

# Theil-Sen Estimator
theil_sen = TheilSenRegressor(random_state=42)
theil_sen.fit(study_hours_with_outlier.reshape(-1, 1), score_with_outlier)

# Quantile Regression 
quantile_predictions_with_outlier = quantile_regression(study_hours_with_outlier.ravel(), score_with_outlier, 0.5)

# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(study_hours, exam_scores, color='skyblue', alpha=0.6, label='Data Points',edgecolor='black', s=50)
plt.scatter(study_hours_outliers, exam_scores_outliers, color='red', label='Outliers', edgecolor='black', s=50)
plt.plot(study_hours_with_outlier, theil_sen.predict(study_hours_with_outlier.reshape(-1, 1)), 'b', label="Theil-Sen Fit")
# Plotting Quantile regression lines
plt.plot(study_hours_with_outlier.ravel(), quantile_predictions_with_outlier, color='orange', label='Quantile Regression Fit')
plt.xlabel("Study Hours")
plt.ylabel("Exam Scores")
plt.title("Theil-Sen Estimator vs Quantile Regression")
plt.legend()
plt.show()

The performance of the Theil-Sen estimator aligns with expectations, showing only minimal impact from the aberrant data points, a resilience similar to the previously discussed RANSAC regression. In contrast, Quantile regression appears less capable of accurately modeling the fundamental relationship between study hours and exam scores when faced with significant outliers.

In conclusion, both RANSAC and Theil-Sen estimation techniques emerge as robust and effective methodologies for model fitting in datasets that are heavily tainted with outliers. Their ability to discern and mitigate the influence of such anomalies makes them invaluable assets for data scientists and analysts, ensuring that the integrity of the underlying data trends can be preserved and accurately represented, even in challenging analytical scenarios.

Challenges and Considerations in Choosing the Right Tool

Selecting the appropriate statistical method to analyse variable relationships, especially amidst outliers, necessitates a careful evaluation of each technique's strengths and limitations. This guidance aims to delineate key considerations for choosing among robust methods, shedding light on their ideal applications and potential constraints.

Spearman's Rank Correlation

When to use:

Best suited for non-parametric data that deviates from normal distribution or characterised by a monotonic yet not strictly linear relationship between variables. It's particularly beneficial for data with ordinal properties or rankings.
Useful when data has ordinal characteristics or is ranked.

Limits:

Although more robust to outliers than Pearson's correlation, Spearman's effectiveness can wane in the face of numerous tied ranks, potentially skewing the correlation coefficient.

Kendall's Tau

When to use:

Optimal for smaller datasets or when detailed examination of rank agreements is necessary.
It excels with ordinal data and in evaluating variable associations without presuming their distribution.

Limits:

The pairwise comparison nature makes it computationally hefty for larger datasets, and its interpretability might not be as straightforward as Pearson's or Spearman's correlations.

Huber Regression

When to use:

Ideal for linear models with predominantly well-behaved data but containing some outliers, especially when these outliers are not excessively extreme and the underlying relationship is linear.

Limits:

Its efficacy wanes as the proportion or extremity of outliers increases.
Precisely setting the tuning parameter, delta, is critical but can be challenging.

Quantile Regression

When to use:

Useful in understanding the conditional distribution of the dependent variable across different quantiles, rather than merely its average.
Useful for capturing the impact of outliers on different quantiles and offers a more complete view of the potential outcomes.

Limits:

Interpretation and communication can be more complex compared to linear regression.
Selecting the appropriate quantiles for analysis often requires domain-specific knowledge.

RANSAC

When to use:

Highly suitable for datasets heavily laden with outliers, aiming to model the subset of data adhering to a certain structure.
Its adaptability extends to both linear and non-linear models.

Limits:

Its effectiveness hinges on parameter selection, such as iteration count and inlier identification threshold.
The stochastic nature may lead to variability in outcomes, necessitating multiple iterations for consistent results.

Theil-Sen Estimators

When to use:

Particularly effective for linear models in datasets with a significant presence of outliers.
As a non-parametric method, it doesn't assume a specific data distribution, broadening its applicability.

Limits:

The method can be computationally intensive for large datasets due to its all-pairs slope calculations.
While robust, it may not perform on par of parametric methods in outlier-free scenarios.

Conclusions

The selection of a statistical technique must be guided by an intricate understanding of each method's unique capabilities and constraints, tailored to the characteristics of the dataset at hand and the particular research inquiry. It's crucial to consider the dataset's attributes, the precise questions being posed, and the foundational assumptions about the data when deciding on the most suitable analytical approach. Often, employing a multifaceted strategy that involves testing various methods and comparing their outcomes can reveal deep insights, aiding in the identification of the most fitting technique for the specific analytical context. This iterative process of exploration and comparison not only enhances the robustness of the analysis but also enriches the researcher's toolkit, equipping them to tackle a wide array of data challenges with greater precision and confidence.

Forward look: Advanced Robust Statistical Methods

In the upcoming tutorial we are set to explore the cutting-edge realm of robust statistical techniques, tailored to address intricate data-related challenges. This final part of our series will focus on high-breakdown point strategies, robust approaches to multivariate analysis, and resilient techniques in time series analysis, providing an in-depth look at their usage and execution. By incorporating Python-based case studies, this segment aims to illustrate the tangible application of these sophisticated methodologies, empowering you with the necessary skills to bolster the robustness and precision of your statistical analyses amidst the complexities posed by data outliers and anomalies.

Tags: Analytics Data Analysis Data Science Getting Started Statistics