5 Python One-Liners to Kick Off Your Data Exploration

Author:Murphy | View: 28075 | Time: 2025-03-22 19:20:50

Exploratory data analysis

When it comes to machine learning, exploratory Data Analysis (EDA) is one the first things you need to do once you've collected and loaded your data into Python.

EDA involves:

Summarizing data via descriptive statistics
Visualizing data
Identifying patterns, detecting anomalies, and generating hypotheses

Through EDA, data scientists gain a deeper understanding of their data, enabling them to assess data quality and prepare for more complex Machine Learning tasks.

But sometimes it can be a challenge when you're first starting out and don't know where to begin.

Here are 5 simple Python 1 liners that can kickstart your EDA process.

1. df.info()

This is a must for every EDA process. In fact this is always the first line of code I run after I've loaded in my df.

It tells you:

The names of columns
How many non-null values are in each column
The data types of the columns

It's a good sanity check on your dataset.

df.info() output for Census dataset (CC by 4.0 license). Image by author

2. df.describe()

This 1-liner works great for datasets that are primarily numerical.

Even if you aren't working with all numerical columns, df.describe() will only show you results for numerical columns. Additionally, you can filter and call df.describe() on individual columns.

It provides you with:

The count of values
The mean of your data
The standard deviation of your data
The minimum and maximum values of your data
Values that mark the 25th, 50th, and 75th percentiles of your data

Continuing to use the Census dataset as an example, if we call df.describe() on the entire dataset, we will only get values for the following columns: ‘age', ‘fnlwgt', ‘education-num', ‘capital-gain', ‘capital-loss', ‘hours-per-week'

But even though these all have numerical values, some of them are categorical – such as ‘education-num' (education-num is a numerical value that corresponds to someone's education level eg Bachelor's, Master's, etc). I also don't want to look at the ‘fnlwgt' column at the moment even though it's technically numerical.

So, if you only want to see the relevant columns:

df[['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week']].describe()

3. df.value_counts()

This is a great one for datasets with a lot of categorical data or for datasets with a binary or categorical target variable.

# Get value counts for our target variable
df['income'].value_counts()

This result is actually really helpful, because it showcases an error in the data.

The function is interpreting "<=50k" and " <=50k." as two separate categories, when really they're meant to be the same thing. It does the same with ">50k" and ">50k."

We can now clean our target columns so that we only have 2 categories, ">50k" and "<=50k".

This method is also helpful for checking how balanced a feature or the target class is. If you see that there are 99 cases of ">50k" but only one case of "<=50k" (an extreme example) that's a clear sign that your dataset is unbalanced.

4. df.corr()

This one will depend on the kind of data you're dealing with and what your goal is.

If you have a time series dataset and your goal is simply to correlate seasonality patterns to some target variable, examining the correlation may not be as insightful, because time series data (once transformed for ML consumption) is categorical, not numerical.

df[['age','capital-gain', 'capital-loss',
       'hours-per-week']].corr()

As you can see, df.corr() produces a DataFrame matrix where each column is compared to every other column, and the Pearson correlation between them.

This can help you to pick out features which could be a good starting place for your model, as well as exclude features that you don't think would be helpful.

It can also be used to identify multicollinearity among features, which can cause problems in certain types of models, especially linear regression.

5. px.scatter & px.line w/ trendline

If you have data that can be plotted on a line or scatter chart, and you want to quickly run a linear regression between 1 variable and 1 target, you can use plotly.express scatter or line charts.

These are very simple to use and all you need to do is pass in a DataFrame and the names of the columns you are plotting on the x and y axes.

I'll show you 2 use cases: one where a trendline would be useful and one where it wouldn't be.

Use case 1: Time series data (No trendline)

Time series data typically comes with a timestamp column (eg Datetime) and the value you want to plot (listed below as ‘y'). I like to see line charts and scatter plots for time series data.

# Line plot
px.line(df, x='Datetime', y='y')

# Scatter plot
px.scatter(df, x='Datetime', y='y')

Line plot. Hourly energy consumption dataset (CC0 license). Image by author

Plotting out your time series data can give you a good idea of seasonal patterns, as well as help to identify outliers, 0 values and chunks of missing data.

Use case 2: 2 Correlated variables (w/ Trendline)

Using the trendline keyword in px.scatter draws a linear regression trendline with your x and y variables using OLS (Ordinary Least Squares) regression algorithm. This is basically just your standard linear regression.

To illustrate this example, I've loaded in the Wine Quality dataset from the UCI Machine Learning Repository (CC by 4.0 license). I plotted 2 features against each other to see the relationship between them:

# Scatter plot with trendline
px.scatter(df, x='fixed_acidity',y='density',trendline='ols')

The OLS trendline provides you with the y=mx+b linear equation as well as the R².

Conclusion

These 5 1-liners are a great place to start if you just loaded in a new dataset and want to start exploring your data. They're a great jumping off point that can lead you into digging deeper into your dataset, as well as pointing to weaknesses in it. Once you've explored your data, you can then start to clean and prepare it for modeling.