11 Useful Pandas Functionalities You Might Have Overlooked
I am quite sure that Pandas
does not need an introduction. In the third article of the series, we will continue to explore some of the useful functionalities of pandas
that you might not have heard about.
In case you want to check out the previous parts, you can find them below:
Let's dive right into it!
A short setup
As we will be exploring the functionalities of pandas
using simple toy examples, we do not need a lengthy list of libraries for this task.
Python">import pandas as pd
import numpy as np
Now we are ready to explore some of the hidden gems of pandas
.
1. nth
As the name suggests, this method returns the nth row (or rows). Using Python's indexing convention, nth(0)
returns the first row.
While this approach might seem very similar to the head
/tail
methods, we can use the nth
method for a more flexible selection. For example, we can use it if we need to access the first and third rows within a group:
Another similarity you might notice is that to the first
/last
methods, which return the first/last non-null value within each group. The key component of the previous sentence is non-null. Using the same grouping condition, the first
method returns the following values:
The difference in behavior between the nth
method and the first
/last
methods is clearly visible for the a
value of the x
column. By using the first
method, we are skipping the missing value in the first row.
2. pop
This simple method removes a single column from a DataFrame and stores it as a new Series object.
If we inspect the X
object after running the snippet, it will be a DataFrame with a single column called x
.
3. compare
While manipulating data, we often want to identify the differences between two DataFrames. To do just that, we can use the compare
method.
The output shows the identified differences. The only difference occurs in the 4th row, in which the value in the first DataFrame (referred to as self) is 4, while in the second one (referred to as other) it is 9.
As the method returns the differences, running the following snippet returns nothing.
a.compare(a)
It is worth mentioning that there are quite a few arguments of the compare
method which we can use to modify the method's behavior. For example, we can use the keep_shape
argument to indicate that we want to keep the original shape of the DataFrames and replace all identical values with NaNs.
If we additionally specify the keep_equal
argument, the NaN values used to mark the identical values will be replaced with the actual values.
One thing to keep in mind is that we can only compare DataFrames that are identically labeled and shaped.
4. align
align
is a useful method we can use to align two DataFrames on their axes by using a specified join type. It will be easier to understand what it means with an example:
First, we have a DataFrame called X
, from which we remove a single column called y
. In the ML context, you might think of those as features and the target, but that is not the only possible interpretation.
Coming back to the example, we then manually filtered the X
DataFrame, which caused the objects not to be aligned anymore – the y
Series has more observations than the X
DataFrame.
To once again align the object using their indices, we use the align
method. In this case, we would like to have matching indices, so we opted for an inner join.
Now, both objects have 3 rows with the same indices.
5. to_markdown
Oftentimes, we want to embed a DataFrame into a report or paper. We can easily do that using the to_markdown
method.
We can use the tablefmt
argument to choose one of the 30+ available table formats. For a complete list of available formats, please refer the documentation. In the following snippet, we generate one of the LaTeX-compatible table types.
6. convert_dtypes
pandas
offers many options to handle data type conversions. In my previous article, I explained how to leverage certain data types to optimize for memory usage. In that approach, we had to choose the data types manually.
Alternatively, we can use the convert_dtypes
method, which converts columns to the (hopefully) best suitable data type. To manage expectations, pandas
will try to figure out the best data type to represent a given column, which does not necessarily mean that it will optimize for memory usage. So if memory is the issue, we will probably still have to indicate the data types ourselves.
Coming back to the example, let's first define a DataFrame with columns containing different data types. The columns will also contain some missing values.
Then, let's inspect the data types.
Now, let's use the convert_dtypes
method on our DataFrame.
By inspecting the output, we can also see that pandas
now supports missing values for integer columns (using pd.NA
), so they no longer need to be represented as floats.
Let's again inspect the data types. We can see that the data types were adjusted by the convert_dtypes
method.
The convert_dtypes
method also has useful arguments which we can use to fine-tune its behavior. For example, we can use the convert_boolean
argument to indicate that we prefer to keep boolean columns encoded using ones and zeros instead of the True
/False
values.
We can see that the d
column now contains an integer representation of the boolean values.
7. ordered CategoricalDtype
Speaking of data types, we already know that we can use the category
data type to store repeating string values in a more performant way, that is, to save a lot of memory.
To convert a string column into a categorical one, we can use the following snippet:
df["x"].astype("category")
To take it one step further, we can also use ordered categorical data types. For example, let's imagine that we are working for a clothing company and we have items in 5 sizes.
To encode that information while keeping the logic of the sizes, we can create a custom CategoricalDtype
data type with the correct order of the categorical values and specify the ordered
argument as True
. Finally, we can convert a column to the newly created data type using the familiar astype
method.
What is the benefit of using the ordered categorical data type? For example, we can easily sort by that category and obtain relevant sorting instead of using the alphabetical order of the string names.
Additionally, we can also use ordered categorical data type for more relevant filtering. In the following snippet, we want to keep only sizes larger than M.
8. SparseDtype
We have already discussed using categorical data type as a potential way of optimizing for memory used by pandas
. Another way of doing that is by leveraging the sparse data type.
For example, we can have numerical columns that contain mostly zeroes. We can significantly reduce the memory consumption by treating such columns as sparse. And to be precise, the majority of the values does not need to be zeros, it could be represented by NaNs or any other value. As long as that single value is repeated frequently.
Under the hood, sparse objects are compressed in such a way that any data matching a specific value (0, NaN, or any other value that has the majority) is omitted. To save the space, such compressed values are not actually stored in the array.
Let's generate a massive DataFrame with the majority of values equal to zero.
df = pd.DataFrame(np.random.randint(0, 100, size=(10000000, 5)))
df[df <= 90] = 0
Then, we define a helper function used for evaluating the memory consumption of a DataFrame:
def memory_usage(df):
return(round(df.memory_usage(deep=True).sum() / 1024 ** 2, 2))
The initial consumption is:
We can try to downcast the numeric type to the smallest available one – uint8
.
Doing so resulted in a 88% reduction of the used memory. As the next step, let's use the sparse data type.
Be leveraging the sparse data type we achieved a 55% reduction from the previous solution (using uint8
data type) and a 94% reduction compared to the initial DataFrame.
9. crosstab
Aggregating is a great way to summarize vast amounts of data into a comprehensive and informative summary. One of the pandas
functions used to aggregate data is crosstab
. By default, it counts the occurrences of certain combinations of values in the columns of a DataFrame.
First, let's generate a DataFrame with some categorical and numerical values.
Using the simplest form of the crosstab
function generates the following table of the value counts between the region and category columns.
We can take it a step further and aggregate over multiple categories at once.
We can also display the row-wise and column-wise totals by using the margins
argument.
If we are not interested in the counts but the distribution, we can use the normalize
argument to show the percentages.
We can also combine the margins
and normalize
arguments so that the normalization is done across rows or columns. To do that, we should pass "index"
or "columns"
to the normalize
argument, while margins
is set to True
.
Lastly, we can also use crosstab
to aggregate numeric data. Using the following snippet, we calculate the average sales over the regions and categories.
10. swaplevel
To be honest, I am not fan of working with MultiIndex
, as it always results in a small headache. However, swaplevel
is one of the methods that can make working with MultiIndex
easier. What it does is it simply swaps the position of the indices within a MultiIndex
.
Let's re-use some of the code from the crosstab
example and generate a simple aggregation table.
If we want to switch the position of the indices, we can use the swaplevels
method.
For more complex cases, we can provide the integer location of the indices we want to swap and also the axis
(rows or columns) over which we want to swap the indices.
Naturally, this was a simplified example and we could have just swapped the indices by changing the order of the columns in the crosstab
function. But I hope you get the idea