Streamlining Repetitive Tasks During Exploratory Data Analysis

Author:Murphy | View: 26465 | Time: 2025-03-23 12:20:41

Automation in Data Science

Programming Principle: Automate the Mundane

They often say lazy programmers make the best programmers. However, it's more accurate to say that programmers who don't have the patience for repetitive workflows will take the upfront investment of time to automate whatever they can so they can avoid such tasks. In short, the best programmers don't patiently repeat mundane tasks – they automate them. Skilled programmers are "lazy" because they invest time upfront to create tools that will save them effort down the road. This may mean learning keyboard shortcuts, creating custom modules, or finding smart software to automate workflows.

In a post titled, "Why Good Programmers are Lazy and Dumb," Philipp Lenssen states:

"Only a lazy programmer will avoid writing monotonous, repetitive code – thus avoiding redundancy, the enemy of software maintenance and flexible refactoring […] for a lazy programmer to be a good programmer, he (or she) also must be incredibly unlazy when it comes to learning how to stay lazy – that is, which software tools make his work easier, which approaches avoid redundancy, and how he can make his work be maintained and refactored easily."

Nobody enjoys tedious and monotonous tasks and if anyone should find themselves repeating the same functions across projects, this overarching frustration should start to creep in to haunt them and whisper, "package them into a module."

The Repetitive Nature of EDA

One area where these whispers have come up to haunt me is during the exploratory data analysis phase.

Exploratory data analysis (EDA) involves using statistical techniques and visualizations to study your data, understand its structure, identify patterns, and detect any irregularities or outliers. Often, identical analyses and visuals are needed for new datasets, therefore, EDA can benefit largely from Automation.

Limits of Full Automation

However, I'd been deterred each time during my prior attempts because complete automation is hindered by the unique challenges of each dataset, such as determining encoding strategies and ensuring correct data types. The interplay between the data cleaning process and Data Analysis is a repetitive one, therefore, it can be difficult to standardize entirely.

A Modular Approach

To address this limitation, I created a utility that assumes the data has already been minimally processed with the correct data types. It also necessitates defining the numerical columns, categorical columns, and the target column (it assumes we're working with a classification task).

What does it contain?

High-level statistics for both numerical and categorical data
Statistical tests of significance
A correlation heatmap
Category averages
Data distribution visualizations

The function also offers flexibility with optional parameters to enable or disable any of the functionalities above.

This article aims to demonstrate the value of creating customized EDA utilities. While the example focuses on automated summaries and visualizations, the key is to identify your pain points around repetitive EDA work and codify your own repetitive workflows. Rather than including the full code, I will focus on demonstrating the key capabilities and sample outputs of the utility.

The Dataset

The dataset was uploaded to Kaggle for the purpose of examining the factors that may be predictive of whether or not a patient will get diagnosed with having a stroke.

Light Pre-processing and Feature Engineering

I began the process by:

Extracting HDL and LDL cholesterol values from ‘Cholesterol Levels'
Generating binary indicator columns for each symptom
Converting both categorical columns and the target column into numerical codes through label encoding

# Define a function to extract values from a column and convert to integer
def extract_and_convert(column, prefix):
    return column.str.extract(f'{prefix}(d+)')[0].astype(int)

# Extract HDL and LDL values and add them as new columns
df['HDL'] = extract_and_convert(df['Cholesterol Levels'], 'HDL:')
df['LDL'] = extract_and_convert(df['Cholesterol Levels'], 'LDL:')

# List of unique symptoms
unique_symptoms = ['Difficulty Speaking', 'Headache', 'Loss of Balance', 'Dizziness',
                   'Confusion', 'Seizures', 'Blurred Vision', 'Severe Fatigue',
                   'Numbness', 'Weakness']

# Create binary columns for each unique symptom indicating its presence in 'Symptoms'
df[unique_symptoms] = df['Symptoms'].str.contains('|'.join(unique_symptoms))

# Convert categorical columns to numerical codes using label encoding
df[categorical_columns] = df[categorical_columns].apply(lambda x: pd.factorize(x)[0])

# Convert the target variable to numerical codes using label encoding
df[target] = pd.factorize(df[target])[0]

Sample of Lightly Pre-processed Data

5 Randomly Sampled Observations of the Dataset After Feature Engineering. Image by Author.

From here, there are two steps I need to take:

Define the numerical, categorical, and target columns
Run summary() and input which functions I'd like to see

Summary()

Define Numerical, Categorical, and Target Columns

# Define numerical columns 
numerical_columns = ['age', 'bmi', 'glucose', 'stress', 'bp', 'hdl', 'ldl', ]

# Define categorical columns
categorical_columns = ['gender', 'hypertension', 'heart_dis', 'married', 'work', 'residence',
                       'smoker', 'alcohol', 'fitness', 'stroke_history', 'family_stroke_history',
                       'diet', 'speech', 'headache', 'balance', 'dizziness', 'confusion',
                       'seizures', 'vision', 'fatigue', 'numbness', 'weakness']

# Define target column
target = 'diagnosis'

In this article, I've included the larger utility summary() and excluded the helper functions: calculate_entropy() , statistical_tests() , plot_distribution_plots() , plot_correlation_heatmap() , calculate_categorical_summary , calculate_numerical_summary() .

Summary() Implementation

def summary(df: pd.DataFrame, 
            numerical_columns: list, 
            categorical_columns: list, 
            target: str,
            categorical_summary: Optional[bool] = True, 
            numerical_summary: Optional[bool] = True,
            perform_tests: Optional[bool] = True, 
            plot_corr_heatmap: Optional[bool] = True,
            calculate_cat_averages: Optional[bool] = True, 
            plot_distribution: Optional[bool] = True) -> None:
    """
    Generate a summary of data exploration tasks.
    """
    df_numerical = df[numerical_columns]
    df_categorical = df[categorical_columns]

    # Join numerical and categorical columns together
    df_joined = df_numerical.join(df_categorical)
    df_joined[target] = df[target]

    if categorical_summary:
        print('nCATEGORICAL SUMMARY')
        categorical_summary = calculate_categorical_summary(df_categorical)
        display(categorical_summary.round(2))

    if numerical_summary:
        print('nNUMERICAL SUMMARY')
        numerical_summary = calculate_numerical_summary(df_numerical)
        display(numerical_summary.round(2))

    if perform_tests:
        print('nSTATISTICAL TESTS')
        df_summary = statistical_tests(df, categorical_columns, numerical_columns, target)
        display(df_summary.round(2))

    if plot_corr_heatmap:
        plot_correlation_heatmap(df_joined)

    if calculate_cat_averages:
        for col in categorical_columns:
            display(df_joined.groupby(col).mean())

    if plot_distribution:
        plot_distribution_plots(df, categorical_columns + [target], numerical_columns)

Categorical and Numerical Summaries

The utility generates two statistical summaries – one for categorical variables and one for numerical variables.

The categorical summary provides high-level insight into each category, including:

Number of unique values
The most frequent value and its frequency
Percentage of missing values
Entropy – a measure of randomness in the distribution

The numerical summary calculates common descriptive stats like:

Number of unique values
Percentage of missing values
Number of outliers
Central tendency measures (mean, median)
Dispersion measures (standard deviation, min/max)

This breakdown serves as a rapid evaluation of the distribution and integrity of both categorical and numerical data. These summaries effectively pinpoint areas warranting deeper exploration, such as notable instances of missing data or significant outliers. Collectively, they offer a comprehensive snapshot of the dataset's fundamental characteristics.

For instance, below, it's evident that there are four outliers in the blood pressure data, with half of the population having a history of stroke, and 75% of the patients exhibiting high blood pressure.

Statistical Tests

The statistical test summary includes statistical test results to evaluate the relationship between each feature and the target variable. The utility runs a chi-squared test for categorical variables and a two-tailed t-test for numerical variables to assess the relationship between each feature and the target.

However, these tests have limitations. They detect linear correlations but can miss non-linear associations or complex interactions between variables. The results provide a starting point for identifying potentially predictive features, but further analysis is needed to uncover nuanced relationships. Therefore, the automatic tests accelerate initial feature screening but should be combined with deeper techniques like multivariate modeling and ensemble methods to derive further insights.

Correlation Heatmap

This visualization highlights the Spearman correlation between numerical variables, ordinal variables, and the target variable. This measure of correlation was chosen because it is more robust for capturing various types of relationships. Unlike Pearson's correlation, Spearman's is non-parametric, making it suitable for ordinal, categorical, or non-linear relationships.

Plots

For distribution visualizations, summary() will return barplots for categorical variables and histograms and boxplots for numerical variables. The distribution visualizations can reveal where data should be separated and treated differently and may highlight quality assurance (QA) issues or anomalies.

Histograms, Boxplot, and Boxplot with No Outliers for Numerical Data. Image by Author.

Concluding Remarks

This article demonstrated a sample EDA utility focused on the automated generation of statistical summaries, visualizations, and basic feature analysis. While not comprehensive, it allows rapid exploration of new datasets and surfaces insights to guide more targeted analysis. With some customization, these utilities can be adapted to suit the typical exploratory workflow for different domains or business contexts.

The key is identifying redundancies in your process and taking time upfront to codify your workflow. This compounds over time, allowing you to focus cognitive resources on higher-value areas like domain knowledge, feature engineering, and modeling. In short – create your utilities, automate the repetition, and let automation handle the grunt work so you can focus on the art.

Tags: Automation Data Analysis Data Science Machine Learning Tips And Tricks