Statistically Confirm Your -Comparing Pandas and Polars with 1 Million Rows of Data

Author:Murphy | View: 27495 | Time: 2025-03-22 21:11:27

Comparing benchmarking scores with the Independent samples t-test and Welch's t-test using Python

Benchmarking is one of my favorite content on the Internet. Not only is it exciting to see results, it also helps me compare things without wasting my time and money. For example, articles and videos that compare GPUs can guide the next laptop I should buy. In coding, benchmarking helps me see which methods work faster.

However, be careful with everything you see on the Internet because you can get quite a different result. Why?

I'm not saying that the published scores are not real. Those tests must be conducted under a standard and trustable process. By the way, the problem isn't performing the same test on other devices; even with the same specs, the outcomes may be different. The benchmarks can vary due to various factors (ref1, ref2).

An example in this article, comparing the Pandas and Polars libraries executed time – Image by author.

For some hardware and software testing, it is hard to experiment if the resources are limited. By the way, for coding, it can be easier to experiment on your own device. In case you can conduct the test, a simple solution is applying standard statistical methods to confirm the scores.

This article will guide you step-by-step how to compare benchmarks using the independent samples t-test, Welch's t-test, and Python. The main purpose is to explain how these statistical tests work. The code and methods can also be applied for testing in other scenarios.

Let's get started…

Independent samples t-test and Welch's t-test formula

Firstly, we are going to have a look at the formula of the Independent samples t-test and Welch's t-test.

Independent samples t-test

where

The variables in the Independent samples t-test formula can be explained:

t = Student's t-test and Welch's t-test, statistical test for testing whether the difference between two groups is statistically significant or not
x̄₁, x̄₂ = mean of the first and the second group
d₀ = difference of population mean
sₚ = pooled standard deviation
s₁, s₂ = standard deviation of the first and the second group
n₁, n₂ = size of the first and the second group

Simply put, the Independent samples t-test is a statistical method for comparing means between two unrelated groups. Thus, it can be applied to check whether the difference between two benchmark groups is statistically significant or not.

By the way, the Independent samples t-test required the data in two unrelated groups to have equal variance. In case we have to deal with unequal variances, the Welch's t-test can be used instead.

Welch's t-test

From the formula, we can notice the difference that while the Independent samples t-test uses pooled variance (sₚ²), Welch's t-test uses each sample group variance (s₁² and s₂²). This makes the former useful for comparing means between two data groups with equal variance and the latter is suitable for comparing means between two data groups with unequal variance.

Another different point is degrees of freedom. While the Independent samples t-test's degrees of freedom formula is (n₁ + n₂) – 2, Welch's t-test applies degrees of freedom using the Welch–Satterthwaite equation.

Both the Independent samples t-test and Welch's t-test have a similar null hypothesis which is the two data groups have a different value of the population mean equals d₀. In case d₀ equals **** 0, the null hypothesis is the two data groups have identical mean.

Assumptions

In statistical analysis, the input data must assume certain characteristics. The main purpose is to reduce unacceptable errors in the process. For the Independent samples t-test and Welch's t-test, there are four assumptions that we should complete before conducting the test:

1. Independence of observations: each data should belong to only one group.
2. No outliers: data points that differ significantly from other data points.
3. Normality: each data group should have an approximately normal distribution.
4. Homogeneity of variances: the variance between data groups should be equal. As previously explained, in case of not completing this assumption, Welch's t-test can be used instead.

Case Study: Pandas VS Polars with a 1-million rows dataset

To illustrate how to apply the statistical tests, I have come up with an idea to perform an actual test. This article will compare what library works faster between the Pandas and Polars libraries. The functions that will be tested consist of writing and reading CSV files and sorting DataFrame.

To make the test look interesting, the dataset used is a randomly generated 1 million rows dataset. All of these are done on my laptop which has Intel Core i5–1035G1, 24 GB RAM, and 500GB SSD.

Firstly, we will create a DataFrame using the Pandas and Polars libraries. After that, a set of similar functions, writing and reading CSV files and sorting the DataFrame, will be used multiple times. In each iteration, the executed time will be measured and collected.

Then, the obtained time data will be checked whether they have completed the four assumptions or not. Lastly, if the data have completed all the requirements, the difference average times will be tested using the Independent samples t-test and Welch's t-test.

Import libraries

Besides the Pandas and Polars libraries, we will also import other useful libraries that will facilitate the statistical test:

Matplotlib and Seaborn for showing data visualization
Time and Tqdm for extracting the executed time
SciPy and Statsmodels for statistical calculation

import numpy as np
import pandas as pd
import polars as pl
import time
import matplotlib.pyplot as plt
import seaborn as sns

from tqdm import tqdm
from scipy import stats
from statsmodels.stats.weightstats import ttest_ind

%matplotlib inline

Get data

We will randomly generate 1 million x 3 array data using numpy.random.uniform.

x = np.random.uniform(low=0.01, high=100, size=(1000000,3))
x

Create two DataFrames, one Pandas DataFrame and one Polars DataFrame, from the array.

###Pandas DataFrame
df_pd = pd.DataFrame(x)
df_pd.head()

###Polars DataFarme
df_pl = pl.DataFrame(x)
df_pl.head()

Now that we have already got the DataFrames, let's continue collecting the executed times from each library's functions. In total, there are three functions that we are going to work with:

Write CSV file
Read CSV file
Sort DataFrame

Write CSV file

The first function is writing a CSV file using Pandas.to_csv and Polars.write_csv. To have enough data points for statistical testing, the code below shows how we will iterate the writing function 50 times. If you follow this tutorial, please be aware that each iteration will write new files onto your computer.

While running the code, the elapsed times are recorded into two lists, one for saving the results from the Pandas library and the other one for the Polars library. Then, the two lists are re-calculated to extract the executed time in each iteration.

Use Boxplot to show the data distribution from the collected time data.

Box plots show time data distributions comparing Pandas and Polars writing CSV functions – Image by author.

It can be seen that the Polars library works way more faster than the Pandas library. Even though this may seem obvious, confirming the results using statistical methods will tell us whether the difference is statistically significant or not.

From the Pandas library results, we need to get rid of the outliers for completing the assumptions. We will talk about how to remove them when we finish collecting the data from the other two functions.

Read CSV file

For the second function, we will compare Pandas.read_csv and Polars.read_csv for reading the CSV files. The files used are all those one million rows data that are created with the write functions. The executed times are collected using the same process as explained in writing the CSV files.

Show the collected time data distributions using Box plots.

Box plots show time data distributions comparing Pandas and Polars reading CSV functions – Image by author.

Once again, the Polars library wins Pandas library in reading CSV files. The obtained result will be statistically confirmed using the Independent samples T-test or Welch's t-test later.

Sort DataFrame

The last function for comparing the two libraries is sorting the one million rows DataFrame. The Pandas.sort_values and Polars.sort functions are compared as shown in the code below.

With the same concept, the code below shows how to iterate each function 50 times. While running the code, the elapsed times in each iteration are collected into two lists. Then, the executed times will be extracted from the two lists later.

Plot the collected time data distributions using Box plots.

Boxplot shows time data distributions comparing Pandas and Polars sorting functions – Image by author.

For the third function, the Polars library still wins the Pandas library. Both data distributions show outliers that need to be eliminated later. A difference that can be noticed from the last Box plots is the two data distributions look more similar to each other. By the way, we still need to do a statistical test to assume equal variance. This will be explained in the following section.

Now that we have the time data collected, let's check whether they have completed the assumptions for the Independent samples t-test or Welch's t-test or not.

Checking the assumptions

1. Independence of observations

Since the lists of executed time are collected separately between the Pandas and Polars libraries, we have already completed the first assumption which is the independence of observations.

2. No outliers

Outliers are any data point that has values under 1.5 iqr (Interquartile range) below q1 (the first quartile) or 1.5 iqr above q3 (the third quartile). The interquartile range can be calculated with the formula iqr = q3 – q1.

The numpy.percentile function can be applied to get the q1 and q3 values. Let's define a function for easily removing the outliers as shown in the code below.

Apply the function to remove outliers and use Box plots to show the results. In the results below, Pandas and Polars are the results after getting rid of the outliers.

Writing CSV files data distributions after removing outliers – Image by author.

Reading CSV files time data distributions after removing outliers – Image by author.

Sorting DataFrame time data distributions after removing outliers – Image by author.

As you can see, the updated time data lists contain no outliers. Thus, our executed time data have completed the second assumption which is having no outliers.

3. Normality

According to the Central limit theorem, if the collected data is large enough (n >= 30), the data distribution can assume normality. To confirm that we have enough data points, let's check the sample size.

print('Pandas read CSV sample size: ',len(pd_wr2))
print('Polars read CSV sample size: ',len(pl_wr2))
print('Pandas write CSV sample size: ',len(pd_rd2))
print('Polars write CSV sample size: ',len(pl_rd2))
print('Pandas sote DataFrame sample size: ',len(pd_st2))
print('Polars sort DataFrame sample size: ',len(pl_st2))

After checking, the lowest sample size number that we have got is 44 and the highest number is 49. Thus, we have completed the third assumption.

In case you have encountered with test that may take too long or the number of collected data is quite low (n < 30), normality can be tested using the Shapiro-Wilk test. The null hypothesis is the data was drawn from a normal distribution.

4. Homogeneity of variances

For the last assumption, we can use Levene's test to test whether variances between data groups are equal or not. The Levene's test has the null hypothesis that all input sample data are from populations with equal variances.

The Levene's test can be easily performed using stats.levene function from the Scipy library, as shown in the code below.

print("Write CSV Levene's test p-value:", stats.levene(pd_wr2,pl_wr2)[1])
print("Read CSV Levene's test p-value:", stats.levene(pd_rd2,pl_rd2)[1])
print("Sort DataFrame Levene's test p-value:", stats.levene(pd_st2,pl_st2)[1])

To decide whether we should reject the null hypothesis or not, the p-value will tell us. The null hypothesis is rejected when the p-value is less than 0.05. On the contrary, we will fail to reject the null hypothesis when the p-value is greater than 0.05.

From the Levene's test results, we can reject the null hypothesis that the sample data in writing and reading CSV files have equal variance since the p-values are lower than 0.05. By the way, we fail to reject the null hypothesis that the sample data in sorting DataFrames have equal variance due to the p-value higher than 0.05.

If the data distributions assume equal variance, we can use the Independent samples t-test. If not, the Welch's t-test can be used instead. Now that we have all the assumptions completed, let's start the test.

Independent samples t-test and Welch's t-test

Here comes the statistical testing part. Let the tests begin!!

Pandas VS. Polars write CSV file

From the previously explained formula, we need to set up the difference of population mean value (d₀) for testing. First, we will calculate the mean value from each sample data. Then, the difference between the mean values will be used for testing.

For the first test, we can set the null hypothesis as the executed mean time difference between the Polars and Pandas libraries functions for writing 1 million rows CSV files is 3.684 seconds on my laptop.

For testing, we will use ttest_ind function from the Statsmodels library. Since the sample data have unequal variance, we need to perform Welch's t-test by setting the usevar parameter with ‘unequal'. The value of d₀ is used for the value parameter. The ttest_ind function has a confidence level set up at 0.95.

Using the ttest_ind function, we can get both t-statistic and p-value. They can be converted into each other using t-table. This article will mainly use the p-value since it is easier to see whether we can reject the null hypothesis or not.

From the result, the 0.953 p-value makes us fail to reject the null hypothesis. Thus, it can be said that the executed mean time difference between the Polars and Pandas libraries functions for writing 1 million rows CSV files is 3.684 seconds on my laptop with a 95 percent confidence level.

Pandas VS. Polars read CSV file

Let's apply the same concept for the second test.

From the results, the null hypothesis for the second test is the executed mean time difference between the Polars and Pandas libraries in reading 1 million rows CSV files is 0.413 seconds on my laptop. The previous Levene's test result told us that the two samples assume unequal variance, so we will use Welch's t-test.

The result shows the p-value equals 0.922. This denies me to reject the null hypothesis. Thus, I can say that the mean time difference between the Polars and Pandas libraries in reading 1 million rows CSV files is 0.413 seconds on my laptop with a 95 percent confidence level.

Pandas VS. Polars sort DataFrame

Lastly, we will statistically confirm the executed mean time difference in sorting a DataFrame between the Pandas and Polars libraries.

In this statistical test, we need to use the Independent samples t-test since the Levene's test result told us that the two samples have equal variance. Thus, the usevar parameter will be set up as ‘pooled'.

The null hypothesis can be set as the executed mean time difference between the Polars and Pandas libraries in sorting 1 million rows DataFrame is 0.114 seconds on my laptop.

The p-value equals 0.855 makes me retain the null hypothesis. Thus, I can be confident that the executed mean time difference difference between the Polars and Pandas libraries in sorting 1 million rows DataFrame is 0.114 seconds on my laptop with a 95 percent confidence level.

Key takeaway

For coding, there is content on the Internet that has compared functions and methods. By the way, the results may not be the same when conducted on different equipment.

Comparing the functions and methods on your device may show how the difference actually is on your computer. This also helps you decide what method works best since each one may have different preferences such as running time, size, or power consumption.

This article has explained how to use the Independent samples t-test and Welch's t-test to statistically test two data samples. The method and code can be applied for testing other functions that you are interested. If you have any suggestions, please feel free to leave a comment.

Thanks for reading.

PS. This article only covers statistical methods for comparing two data samples. For dealing with more than two data groups, ANOVA is a statistical method that helps check if the means of every data group are different from each other or not.

These are some of my data visualization articles that you may find interesting:

Data Visualization Cheat Sheet for Basic Machine Learning Algorithms (link)
7 Visualizations with Python to handle Multivariate Categorical data (link)
8 Visualizations with Python to handle Multiple Time-Series data (link)
7 Visualizations with Python to Express Changes in Rank over Time (link)
9 Visualizations with Python that Catch More Attention than a Bar Chart (link)
9 Visualizations with Python to show Proportions or Percentages instead of a Pie chart (link)