Practical Tips for Improving Exploratory Data Analysis

Author:Murphy | View: 30055 | Time: 2025-03-23 13:04:24

Introduction

Exploratory data analysis (EDA) is a mandatory step before using any machine learning models. The EDA process requires focus and patience from data analysts and data scientists: before getting meaningful insights from the analysed data, it is often necessary to spend a lot of time actively using one or more visualization libraries.

In this post, I will share with you some tips on how to ease the EDA procedure and make it faster, based on my personal experience. In particular, I will give you three important pieces of advice that I learnt while fighting against the EDA:

Use non-trivial charts that are suitable most for your task;
Apply the functionality of the visualization library at its full power;
Look for a faster way of making the same stuff.

Note: to create infographics in this post, we will use the Wind Power Generated Data from Kaggle [2]. Let's get started!

Tip 1: Don't afraid of using non-trivial charts

I learnt how to apply this tip, when I worked on the research paper related to wind energy analysis and prediction [1]. While doing the EDA for this project, I faced a necessity to create a summary matrix that would reflect all the relationships between the wind parameters in order to find which of them have the strongest influence on each other. The first idea came to my mind was to build a ‘good old' correlation matrix that I used to see in many Data Science / Data Analysis projects.

As you know, a correlation matrix is used to quantify and summarize linear relationships between variables. In the following code snippet, the corrcoef function was used on the feature columns of Wind Power Generated Data. Here I also applied the heatmap function from Seaborn to plot the correlation matrix array as a heat map:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# read data
data = pd.read_csv('T1.csv')
print(data)

# rename columns to make their titles shorter
data.rename(columns={'LV ActivePower (kW)':'P',
                     'Wind Speed (m/s)':'Ws',
                     'Theoretical_Power_Curve (KWh)':'Power_curve',
                     'Wind Direction (°)': 'Wa'},inplace=True)
cols = ['P', 'Ws', 'Power_curve', 'Wa']

# build the matrix
correlation_matrix = np.corrcoef(data[cols].values.T)
hm = sns.heatmap(correlation_matrix,
                 cbar=True, annot=True, square=True, fmt='.3f',
                 annot_kws={'size': 15},
                 cmap='Blues',
                 yticklabels=['P', 'Ws', 'Power_curve', 'Wa'],
                 xticklabels=['P', 'Ws', 'Power_curve', 'Wa'])

# save the figure
plt.savefig('image.png', dpi=600, bbox_inches='tight')
plt.show()

Example of the built correlation matrix. Image by Author.

Analysing the resulting graphical results, it can be concluded that wind speed and active power have a strong correlation, but I think many people will agree with me that this is not an easy way to interpret the results when using this kind of visualization, because here we have only numbers.

A good alternative to the correlation matrix would be the scatterplot matrix, which allows you to visualize pairwise correlations between different features of a data set in one place. In this case, sns.pairplot should be used:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# read data
data = pd.read_csv('T1.csv')
print(data)

# rename columns to make their titles shorter
data.rename(columns={'LV ActivePower (kW)':'P',
                     'Wind Speed (m/s)':'Ws',
                     'Theoretical_Power_Curve (KWh)':'Power_curve',
                     'Wind Direction (°)': 'Wa'},inplace=True)
cols = ['P', 'Ws', 'Power_curve', 'Wa']

# build the matrix
sns.pairplot(data[cols], height=2.5)
plt.tight_layout()

# save the figure
plt.savefig('image2.png', dpi=600, bbox_inches='tight')
plt.show()

Example of the scatterplot matrix. Image by Author.

By looking at the scatterplot matrix, one can quickly eyeball how the data is distributed and whether it contains outliers or not. However, the main drawback of this kind of charts is related to the presence of duplicates due to the pairwise approach to plotting data.

In the end, I decided to combine the above graphs into one, where the lower left part will contain scatter plots of the selected parameters, and the upper right part will contain bubbles of different sizes and colours: larger circles mean that the studied parameters have a stronger linear correlation. The diagonal of the matrix will display the distribution of each feature: a narrow peak here would indicate that this particular parameter does not change too much, while other features change.

The code for building this summary matrix is given below. Here the map consists of three parts – fig.map_lower, fig.map_diag, fig.map_upper:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# read data
data = pd.read_csv('T1.csv')
print(data)

# rename columns to make their titles shorter
data.rename(columns={'LV ActivePower (kW)':'P',
                     'Wind Speed (m/s)':'Ws',
                     'Theoretical_Power_Curve (KWh)':'Power_curve',
                     'Wind Direction (°)': 'Wa'},inplace=True)
cols = ['P', 'Ws', 'Power_curve', 'Wa']

# buid the matrix
def correlation_dots(*args, **kwargs):
    corr_r = args[0].corr(args[1], 'pearson')
    ax = plt.gca()
    ax.set_axis_off()
    marker_size = abs(corr_r) * 3000
    ax.scatter([.5], [.5], marker_size,
               [corr_r], alpha=0.5,
               cmap = 'Blues',
               vmin = -1, vmax = 1,
               transform = ax.transAxes)
    font_size = abs(corr_r) * 40 + 5

sns.set(style = 'white', font_scale = 1.6)
fig = sns.PairGrid(data, aspect = 1.4, diag_sharey = False)
fig.map_lower(sns.regplot)
fig.map_diag(sns.histplot)
fig.map_upper(correlation_dots)

# save the figure
plt.savefig('image3.jpg', dpi = 600, bbox_inches = 'tight')
plt.show()

Example of the summary matrix. Image by Author.

The summary matrix combines the advantages of the two previously studied diagrams – its lower (left) part imitates the scatterplot matrix, and its upper (right) fragment graphically reflects the numerical results of the correlation matrix.

Tip 2: Use the functionality of the visualization library to its fullest

From time to time I have to present the results of Eda to colleagues and clients, so visualization is a key assistant for me in this task. I always try to add various elements to the diagrams, such as arrows and notes, to make them even more attractive and readable.

Let's go back to the EDA implementation case for a wind project discussed above. When it comes to wind energy, one of the most important parameters is a power curve. The power curve of a wind turbine (or the entire wind farm) is a graph showing the amount of electricity generated at various wind speeds. It is important to note that turbines do not operate at low wind speeds. Their start-up is associated with a cut-in speed, which is usually in the range of 2.5–5 m/s. At speeds between 12 and 15 m/s, the nominal power is reached. Finally, each turbine has an upper limit on the wind speed at which it can safely operate. Once this limit of the cut-out speed is reached, the wind turbine will not produce electricity unless its speed drops back into the operating range.

The studied dataset includes both the theoretical power curve (which is a typical curve from the manufacturer without any outliers) and the actual curve obtained if we plot wind power versus speed. The latter usually contains many points outside the ideal theoretical shape which might be caused by turbine failure, incorrect SCADA measurements, or unscheduled maintenance.

Now we will create an image that would display both types of the power curve – first, without any additional items, except legend:

import pandas as pd
import matplotlib.pyplot as plt

# read data
data = pd.read_csv('T1.csv')
print(data)

# rename columns to make their titles shorter
data.rename(columns={'LV ActivePower (kW)':'P',
                     'Wind Speed (m/s)':'Ws',
                     'Theoretical_Power_Curve (KWh)':'Power_curve',
                     'Wind Direction (°)': 'Wa'},inplace=True)

# build the plot
plt.scatter(data['Ws'], data['P'], color='steelblue', marker='+', label='actual')
plt.scatter(data['Ws'], data['Power_curve'], color='black', label='theoretical')
plt.xlabel('Wind Speed')
plt.ylabel('Power')
plt.legend(loc='best')

# save the figure
plt.savefig('image4.png', dpi=600, bbox_inches='tight')
plt.show()

A ‘silent' chart of the wind power curve. Image by Author.

As you can see, the graph needs an explanation, since it does not contain any additional details.

But what if we add lines to highlight the three main areas of the graph with cut-in, nominal and cut-out speeds marked, as well as a note with arrow to show one of the outliers?

Let's check how the graph will look like in this case:

import pandas as pd
import matplotlib.pyplot as plt

# read data
data = pd.read_csv('T1.csv')
print(data)

# rename columns to make their titles shorter
data.rename(columns={'LV ActivePower (kW)':'P',
                     'Wind Speed (m/s)':'Ws',
                     'Theoretical_Power_Curve (KWh)':'Power_curve',
                     'Wind Direction (°)': 'Wa'},inplace=True)

# build the plot
plt.scatter(data['Ws'], data['P'], color='steelblue', marker='+', label='actual')
plt.scatter(data['Ws'], data['Power_curve'], color='black', label='theoretical')

# add vertical lines, text notes and arrow
plt.vlines(x=3.05, ymin=10, ymax=350, lw=3, color='black')
plt.text(1.1, 355, r"cut-in", fontsize=15)
plt.vlines(x=12.5, ymin=3000, ymax=3500, lw=3, color='black')
plt.text(13.5, 2850, r"nominal", fontsize=15)
plt.vlines(x=24.5, ymin=3080, ymax=3550, lw=3, color='black')
plt.text(21.5, 2900, r"cut-out", fontsize=15)
plt.annotate('outlier!', xy=(18.4,1805), xytext=(21.5,2050),
            arrowprops={'color':'red'})

plt.xlabel('Wind Speed')
plt.ylabel('Power')
plt.legend(loc='best')

# save the figure
plt.savefig('image4_2.png', dpi=600, bbox_inches='tight')
plt.show()

A ‘talkative' chart of the wind power curve. Image by Author.

Tip 3: Always find a faster way of making the same stuff

When analysing wind data, we often want to have comprehensive information about the potential of wind energy. Therefore, in addition to the dynamics of wind energy, it is necessary to have a graph showing how the wind speed depends on the wind direction.

To illustrate the changes in wind power, the following code can be used:

import pandas as pd
import matplotlib.pyplot as plt

# read data
data = pd.read_csv('T1.csv')
print(data)

# rename columns to make their titles shorter
data.rename(columns={'LV ActivePower (kW)':'P',
                     'Wind Speed (m/s)':'Ws',
                     'Theoretical_Power_Curve (KWh)':'Power_curve',
                     'Wind Direction (°)': 'Wa'},inplace=True)

# resample 10-min data into hourly time measurements
data['Date/Time'] = pd.to_datetime(data['Date/Time'])
fig = plt.figure(figsize=(10,8))
group_data = (data.set_index('Date/Time')).resample('H')['P'].sum()

# plot wind power dynamics
group_data.plot(kind='line')
plt.ylabel('Power')
plt.xlabel('Date/Time')
plt.title('Power generation (resampled to 1 hour)')

# save the figure
plt.savefig('wind_power.png', dpi=600, bbox_inches='tight')
plt.show()

Below is the resulting plot:

Dynamics of the wind power. Image by Author.

As one might noticed, the profile of wind power dynamics has a quite complex, irregular shape.

A windrose, or a polar rose plot, is a special diagram for representing the distribution of meteorological data, typically wind speeds by direction [3]. There is a simple module windrose for the matplotlib library, which allows to easily build this sort of visualizations, e.g.:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from windrose import WindroseAxes

# read data
data = pd.read_csv('T1.csv')
print(data)

# rename columns to make their titles shorter
data.rename(columns={'LV ActivePower (kW)':'P',
                     'Wind Speed (m/s)':'Ws',
                     'Theoretical_Power_Curve (KWh)':'Power_curve',
                     'Wind Direction (°)': 'Wa'},inplace=True)
wd  = data['Wa']
ws = data['Ws']

# plot normalized wind rose in a form of a stacked histogram
ax = WindroseAxes.from_ax()
ax.bar(wd, ws, normed=True, opening=0.8, edgecolor='white')
ax.set_legend()

# save the figure
plt.savefig('windrose.png', dpi = 600, bbox_inches = 'tight')
plt.show()

A windrose obtained based on the available data. Image by Author.

Looking at the wind rose map, one can notice that there are two main wind directions – north-east and south-west.

But how to merge these two images into a single one? The most obvious option is to use add_subplot. Though due to the specialities of windrose library, it is not a straightforward task:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from windrose import WindroseAxes

# read data
data = pd.read_csv('T1.csv')
print(data)

# rename columns to make their titles shorter
data.rename(columns={'LV ActivePower (kW)':'P',
                     'Wind Speed (m/s)':'Ws',
                     'Theoretical_Power_Curve (KWh)':'Power_curve',
                     'Wind Direction (°)': 'Wa'},inplace=True)
data['Date/Time'] = pd.to_datetime(data['Date/Time'])

fig = plt.figure(figsize=(10,8))

# plot both plots as subplots
ax1 = fig.add_subplot(211)
group_data = (data.set_index('Date/Time')).resample('H')['P'].sum()
group_data.plot(kind='line')
ax1.set_ylabel('Power')
ax1.set_xlabel('Date/Time')
ax1.set_title('Power generation (resampled to 1 hour)')

ax2 = fig.add_subplot(212, projection='windrose')

wd  = data['Wa']
ws = data['Ws']

ax = WindroseAxes.from_ax()
ax2.bar(wd, ws, normed=True, opening=0.8, edgecolor='white')
ax2.set_legend()

# save the figure
plt.savefig('image5.png', dpi=600, bbox_inches='tight')
plt.show()

In this case, the result looks like this:

A single image with wind power dynamics and wind rose. Image by Author.

The major downside here is that the two subplots differ in size, and because of that we have a lot of white empty space around the windrose chart.

To make things easier, I recommend taking a different approach, using the Python Imaging Library **** (PIL) [4] with just 11 (!) lines of code:

import numpy as np
import PIL
from PIL import Image

# list images that needs to be merged
list_im = ['wind_power.png','windrose.png']
imgs = [PIL.Image.open(i) for i in list_im]

# resize all images to match the smallest
min_shape = sorted([(np.sum(i.size), i.size) for i in imgs])[0][1]

# for a vertical stacking - we use vstack
images_comb = np.vstack((np.asarray(i.resize(min_shape)) for i in imgs))
images_comb = PIL.Image.fromarray(imgs_comb)

# save the figure
imgages_comb.save('image5_2.png', dpi=(600,600))

Here the output looks a bit prettier, because two images has the same size, since the code picks the smallest one and rescale others to match images:

A single image with wind power dynamics and wind rose obtained by using PIL. Image by Author.

By the way, while working with PIL one can use a horizontal stacking as well – for instance, let's compare and contract a ‘silent' and a ‘talkative' power curve charts with each other:

import numpy as np
import PIL
from PIL import Image

list_im = ['image4.png','image4_2.png']
imgs = [PIL.Image.open(i) for i in list_im]

# pick the image which is the smallest, and resize the others to match it (can be arbitrary image shape here)
min_shape = sorted([(np.sum(i.size), i.size) for i in imgs])[0][1]

imgs_comb = np.hstack((np.asarray(i.resize(min_shape)) for i in imgs))
### save that beautiful picture
imgs_comb = PIL.Image.fromarray(imgs_comb)
imgs_comb.save('image4_merged.png', dpi=(600,600))

Compare & contrast two charts with power curves. Image by Author.

Conclusion

In this post I shared with you three tips on how to make the EDA process easier. I hope, you found these advice useful for yourself and would start to apply them to your data tasks, too.

These tips perfectly match the formula that I always try to apply while doing the EDA: customize → itemize → optimize.

Well, you may ask, why on earth does this matter? I can say that actually it matters, because:

It is very important to customize your charts to the particular needs that you face right now. For instance, instead of creating tons of infographics, think how you can combine several ones into just one, as we did while creating a summary matrix, which combines the strengths of both scatterplot and correlation charts.
All of your charts should speak for themselves. Thus, you need to know how to itemize important stuff on the chart to make it detailed and well readable. Compare how big the difference is between a ‘silent' and a ‘talkative' power curves.
And finally, every data specialist should learn how to optimize the EDA process to make things faster (and life easier). If you have to merge two images into one, do not necessary use add_subplot option all the time.

What else? I can definitely say that the EDA is a very creative and interesting step in working with data (not to mention that it is also super important).

Let your infographics shine like diamonds, and don't forget to enjoy the process!