Chaining Pandas Operations: Strengths and Limitations

Author:Murphy  |  View: 28678  |  Time: 2025-03-22 20:48:13

PYTHON PROGRAMMING

Photo by Dulcey Lima on Unsplash

The title of this article stresses the strengths and limitations of chaining Pandas operations – but to be honest, I will write about fun.

Why fun? Is it at all important when we have data to analyze?

I don't know what works for you, but for me fun in work is important. During my 20+ years of experience in Data Science, I've found that the more enjoyment I derive from coding, the more satisfied I am from completing the task. And I do mean the process of pursuing the task, not only just completing it. Of course, achieving results matters, probably the most. But trust me, if you dislike the tools you're using, all you'll want is to finish the job as quickly as possible. This can lead to mistakes, as you might work hastily and overlook important details in the data. And that's something you want to avoid.

I transitioned to Python from R, and analyzing data with R is a lot of fun – thanks to the dplyr syntax. I've always enjoyed it, and I still do. However, when I switched to Python, I found myself preferring it over R. I've never really enjoyed programming in R (note the distinction between analyzing data and programming), while programming in Python is a lot of fun to me. Sure, this is subjective, I'm aware of that. But that doesn't change a thing.

There was one exception, though: Python's Pandas versus R's dplyr. There was no comparison – dplyr won in every aspect of data analysis. I didn't like Pandas at all.

Have you noticed the past tense? "Was one exception." "Was no comparison." "Won in every aspect." "Didn't like Pandas."

I used the past tense because things changed – I actually do like Pandas now. I'm even starting to think that I prefer it over dplyr.

So, what changed?

Pandas pipes

Some time ago, I learned something new about Pandas – something that blew my mind. I learned how to chain Pandas operations in order to create pipelines of them.

It's high time I show you some code, otherwise, pennies to donuts, I wouldn't convince anyone. I will use a dataframe created in the following way:

>>> import pandas as pd
>>> def make_df():
...     return pd.DataFrame({
...         "x": [1.]*5 + [1.2]*5 + [3.]* 5 + [1, 5, 2, 3, 3],
...         "y": [20.]*5 + [21.5]*5 + [35.]* 5 + [41, 15, 22, 13, 13],
...         "group": ["a"]*5 + ["b"]*5 + ["c"]* 5 + ["d"]*5,
...     })
>>> df = make_df()

I need this function in order to reuse the original dataframe in subsequent examples.

The pipe operator

Pipes are what make R's dplyr so powerful. Using the %>% operator, you can chain command after command, and such pipes can be very long – but still very readable. And it's this readability that makes the difference. That's what makes dplyr code so powerful.

Typical Pandas code uses brackets and assignments. So, you write line after line of such bracket assignments, and the resulting code looks rather clumsy and overwhelming. And no pipes. Let's see how a simple example:

>>> df["z"] = df["x"] * df["y"]
>>> sum(df[df["group"] == "a"]["z"])
100.0

This is typical Pandas code. You will find it in millions of notebooks and scripts across the world, especially those written by data scientists.

But this is just one style of writing Pandas code. The other style is… pipes! While dplyr uses the %>% operator for chaining, Pandas uses the dot operator (.)— the very typical Python operator for accessing attributes of an object. So, everyone who knows Python already knows how to use the dot operator for pipes. You just need to know a little bit about how to do so in Pandas.

Methods for chaining

Of course, it's not only about chaining operations – it's also about methods to chain. First of all, many pd.DataFrame methods can be chained, such as head(), sum(), isna(), astype(). Some other methods, however, don't naturally fit into the piping approach. Examples constitute pivot(), which reshapes the data frame, and groupby(), which performs aggregates.

Some methods, however, were designed exactly for this very purpose: to be chained. So they can be used instead of code that cannot be chained—so, instead of bracket assignment. Code consisting of a chain of Pandas operations can be much more readable than a sequence of bracket assignments. Surely, not to someone who doesn't know this type of syntax, but you don't need much time to get used to it.

The methods that I use most often in place of bracket assignments are:

  • filter(): to subset a data frame's rows or columns according to the specified index labels
  • drop(): to remove rows or columns using label names and the corresponding axis (rows or columns); alternatively, or by directly specifying index or column names
  • query(): to query the columns of a data frame with a boolean expression provided as a string
  • assign(): to assign (create) new columns to a data frame
  • squeeze(): to squeeze one-dimensional axis objects into scalars
  • pipe(): to call custom functions that take a pd.Series or pd.DataFrame and return one
  • where: to replace values where the condition is False; actually, I consider this method more difficult to grasp and to use than the ones mentioned above
  • apply: to use an external function for a dataframe's column

Before I started chaining Pandas operations, I used only the .apply() method from this list. Be aware that they can be used without pipelining operations, so one after another, line after line, using assignments. A pipe , however, enables to chain these methods in one command.

For me, the above methods constitute the most important part of the Pandas piping API. We'll see examples of how to use them soon. While these methods have proven to be most useful in my projects, there are others that can be quite useful, too; for example:

  • dropna()
  • fillna()
  • rename()
  • sort_values()
  • value_counts()

These are very useful methods, and I guess if you're a Pandas user, you've used at least some of them.

Learn the basics

If you'd like to learn everything that Pandas offers to create pipes, don't. Well, at least not at once. You can do it step by step.

For the moment, just learn the basics, and you'll most probably see – I hope so – how powerful such syntax can be. Once you know the basics and know that this is your thing, you can learn more advanced tools.

In this article, I will show you the basics only – but you will see that these basic chained operations can help you create very readable code, more readable than typical Pandas code.

Chaining Pandas operations

Okay, it's high time to see what I mean. Let's return to this code:

>>> df["z"] = df["x"] * df["y"]
>>> sum(df[df["group"] == "a"]["z"])
100.0

and reimplement it using piping methods. This time, we will need to use two lines of operations, as we cannot create a new column and make calculations on it in the same pipe.

>>> df = make_df()
>>> df = df.assign(z=df.x * df.y)
>>> df.query("group == 'a'").z.sum()
100.0

Note that I changed df["x"] to df.x, but only to show it makes no difference – both will work in exactly the same way (if you'd like to learn the subtle differences between these two approaches, read this article).

If we only want to calculate the sum, we can do this using one chain:

>>> df = make_df()
>>> df.assign(z=df.x * df.y).query("group == 'a'").z.sum()
100.0

Remember, however, that this way the z column will be created on the fly, so the df column won't contain it once the operation returns:

>>> df.columns
Index(['x', 'y', 'group'], dtype='object')

The query() method filters the dataframe based on a Boolean expression; here, we want to keep the rows in which group is "a". Then we take the z column and calculate its sum.

Let's return for the moment to the assign() method. It creates a new column. Here, we used vectorized Pandas operations, which are fast. Sometimes, however, they won't work due to the complexity of the logic we need to use to create the column. In that case, you can use a regular Python list comprehension:

>>> df.assign(z = [
...     a + b if a < 2 and b > 20 else a * b
...     for a, b in zip(df["x"], df["y"])
... ]).iloc[[0, 6, 10]]
    x     y group      z
0   1.0  20.0     a   20.0
6   1.2  21.5     b   22.7
10  3.0  35.0     c  105.0

Above, we used a simple example of a short chain of Pandas operations. The beauty of this syntax comes with long chains. Let's consider the following code:

>>> df = make_df()
>>> filtered_df = df[df["x"] > 1].copy()
>>> filtered_df["w"] = filtered_df["x"] ** 2
>>> grouped_df = filtered_df.groupby("group").agg({"y": "mean", "w": "mean"})
>>> grouped_df = grouped_df.reset_index()
>>> result = grouped_df.sort_values(by="y", ascending=False)
>>> result
  group      y      w
1     c  35.00   9.00
0     b  21.50   1.44
2     d  15.75  11.75

Note the use of the .copy() method of the pd.DataFrame object. This method is necessary because, without it, we would be working on a view of the original DataFrame rather than a separate copy. Modifying a view can lead to unexpected results, which is why Pandas issues a SettingWithCopyWarning warning when attempting to change values in a view. By making a copy, we ensure that our operations do not affect the original dataframe and avoid this warning.

Let's analyze the subsequent steps of this operation:

  1. filtered_df = df[df["x"] > 1]: Filters df to include only rows where x is greater than 1.
  2. filtered_df["w"] = filtered_df["x"] ** 2: Adds a new column w to filtered_df, which is the square of x.
  3. grouped_df = filtered_df.groupby("group").agg({"y": "mean", "w": "mean"}): Groups filtered_df by the group column and calculates the mean of y and w for each group.
  4. grouped_df = grouped_df.reset_index(): Resets the index of grouped_df to convert the group labels back into a column.
  5. result = grouped_df.sort_values(by="y", ascending=False): Sorts the resulting dataframe by the mean value of y in descending order.

Let's refactor this code into one chain of piped Pandas operations:

>>> result_chain = (
...     df[df["x"] > 1]
...     .assign(w=lambda df: df["x"] ** 2)
...     .groupby('group')
...     .agg({"y": "mean", "w": "mean"})
...     .reset_index()
...     .sort_values(by="y", ascending=False)
... )
>>> result_chain
  group      y      w
1     c  35.00   9.00
0     b  21.50   1.44
2     d  15.75  11.75

As you see, we got the very same result, which we can confirm in the following way:

>>> result_chain.equals(result)
True

Let's analyze the chained code:

  • df[df["x"] > 1]: Filters df to include only rows where x is greater than 1.
  • .assign(w=lambda df: df["x"] ** 2): Adds a new column w, which is the square of x.
  • .groupby("group"): Groups the resulting dataframe by the group column.
  • .agg({"y": "mean", "w": "mean"}): Calculates the mean of y and w for each group of the grouped dataframe.
  • .reset_index(): Resets the index to convert the group labels back into a column.
  • .sort_values(by="y", ascending=False): Sorts the resulting dataframe by the mean value of y in descending order.

As you see, both explanations are basically the same, showing that these two blocks of code do the same, using the corresponding operations one after another. The only difference lies in chaining the operations in a single pipe in the latter code.

Note that in the long pipe code, we're creating new dataframes at each step of the pipeline without modifying the original dataframe (df). This avoids the SettingWithCopyWarning because each operation in the pipeline returns a new dataframe rather than modifying an existing view – hence no need of the copy() method.

For filtering, I often use the .query() method instead of typical Pandas filtering. It can be slower, but it's more readable to me:

>>> (
...     df
...     .query("x > 1")
...     .assign(w=lambda df: df["x"] ** 2)
...     .groupby("group")
...     .agg({"y": "mean", "w": "mean"})
...     .reset_index()
...     .sort_values(by="y", ascending=False)
... )
  group      y      w
1     c  35.00   9.00
0     b  21.50   1.44
2     d  15.75  11.75

The only difference here is the line with the .query() method. Here, the filtering is pretty simple, but sometimes, when it gets quite complicated, formulating filtering conditions that way looks just more natural – and is easier to read – than filtering using brackets.

Which of the two versions do you find more readable? To me, the piped version is more readable than the typical step-by-step approach. This is because it allows for a clear, smooth, linear flow of operations. Each step is chained together in a sequence, making it easy to follow the data's operations from start to finish. This method reduces the need for intermediate variables, which can clutter the code and make it harder to understand.

Using pipes encourages writing concise and expressive code. This not only improves readability, but also facilitates maintenance and debugging, as the entire sequence of operations is visible at a glance. When you have several temporary variables, as in the non-piped example above, the reader of the code may wonder whether some of them will be used later on or are used only in this particular place.

Simply put, the piped approach reduces unnecessary code and enhances the logical flow of data operations. This makes the code more intuitive and easier to grasp.

You can join both types of syntax

Sometimes it's just easier to use the typical Pandas syntax. If you know how to do something using this approach, go ahead and do it. You can stop a pipe, use bracket assignment, and then use the resulting object to start a new pipe. I do this from time to time, especially when I don't have enough time to find a piping solution. However, when I do find the time later, I try to revisit and find the piped solution.

This is actually how I'd suggest using pipes in Pandas. While pipes are smoother and more readable, breaking them occasionally isn't a problem. As you can see, I am not a radical.

Remember, though, that when you can use a chained operation, the code will likely be simpler and more readable. This is because Pandas operations chained in pipes are often more readable than regular Pandas syntax.

Readability is a great virtue of any code. But it's not the only one. As you will see in the next section, piped operations can sometimes be slower than typical Pandas code.

Performance

How should you decide which Pandas syntax to use based on performance?

That depends. If it's analysis code that no one will see – use whatever syntax you prefer; the better you know it and the more comfortable you feel using it, the better. When you're working on a data product where performance is of the highest importance, with even the slightest gains making a difference (this is often the case when working on dashboards), then you definitely should use a faster approach. In any other situation, you have to weigh the pros and cons and choose the approach that'll work best under the given circumstances.

To make such decisions, you should have at least basic knowledge about the performance of the different types of syntax. Here, we'll conduct some simple benchmarking experiments, but be aware that they will cover just some basic scenarios. When working on your code, you should make thoughtful decisions. Often, this will mean profiling the code and checking whether a particular part (e.g., a chain of Pandas operations) is a bottleneck or not.

To benchmark regular Pandas code and the corresponding pipe-based code, we'll use the timeit module (you can read about it here). I will run the benchmarks on a Windows 10 machine, 32 GB of RAM, 4 physical and 8 logical cores.

See the Appendix for the benchmarking code. We will change the code_regular and code_pipe variables, which contain the two pieces of code to be benchmarked. Let's start with those contained in the Appendix, that is:

code_pipe = f'df = df.assign(xy = df["x"] * 3 / (1 + df["y"]*.3))'
code_regular = f'df["xy"] = (df["x"] * 3) / (1 + df["y"]*.3)'

This code compares the performance of the pd.DataFrame.assign() method against the traditional vectorized Pandas assignment. Let's see the results:

Benchmarking with df of 100 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0004555
min_time(t) = 0.0004289 
Regular code:
mean_time(t) = 0.0004164
min_time(t) = 0.0003984 

Benchmarking with df of 1000 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0005324
min_time(t) = 0.000495  
Regular code:
mean_time(t) = 0.0004764
min_time(t) = 0.0004467 

Benchmarking with df of 10000 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0012987
min_time(t) = 0.0011558 
Regular code:
mean_time(t) = 0.0006378
min_time(t) = 0.0005557 

Benchmarking with df of 100000 rows and repeat = 1000
Pipe:
mean_time(t) = 0.0066507
min_time(t) = 0.0060729
Regular code:
mean_time(t) = 0.001402
min_time(t) = 0.0010576

Benchmarking with df of 1000000 rows and repeat = 10
Pipe:
mean_time(t) = 0.0694944
min_time(t) = 0.0687732
Regular code:
mean_time(t) = 0.011838
min_time(t) = 0.0114591

As we can see, the .assign() method is about 50–60% slower than the vectorized Pandas code. The difference doesn't seem to depend on the size of the dataframe.

Above, we benchmarked creating a new column. Now let's benchmark filtering columns:

code_pipe = f'df_ = df.filter(["a", "b", "c", "d", "e"])'
code_regular = f'df_ = df.loc[:, ["a", "b", "c", "d", "e"]]'

And here are the results:

Benchmarking with df of 100 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0002483
min_time(t) = 0.0002267
Regular code:
mean_time(t) = 0.0002808
min_time(t) = 0.0002459

Benchmarking with df of 1000 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0002972
min_time(t) = 0.0002681
Regular code:
mean_time(t) = 0.0003176
min_time(t) = 0.0002926

Benchmarking with df of 10000 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0004878
min_time(t) = 0.0003939
Regular code:
mean_time(t) = 0.0005025
min_time(t) = 0.0004806

Benchmarking with df of 100000 rows and repeat = 1000
Pipe:
mean_time(t) = 0.0013341
min_time(t) = 0.0011952
Regular code:
mean_time(t) = 0.0013245
min_time(t) = 0.0011956

Benchmarking with df of 1000000 rows and repeat = 100
Pipe:
mean_time(t) = 0.0094321
min_time(t) = 0.0090392
Regular code:
mean_time(t) = 0.0095467
min_time(t) = 0.0087992

These results are slightly different: for most dataframes, the .filter() method was slightly slower. For dataframes of 100,000 rows, it was even slightly faster.

Now, let's filter rows:

code_pipe = f'''df_ = df.query("a >= @mean_a")'''
code_regular = f'df_ = df[df["a"] >= mean_a]'

Here are the results:

Benchmarking with df of 100 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0017628
min_time(t) = 0.0016239 
Regular code:
mean_time(t) = 0.0001927
min_time(t) = 0.0001791 

Benchmarking with df of 1000 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0018353
min_time(t) = 0.0017287 
Regular code:
mean_time(t) = 0.0002581
min_time(t) = 0.0002381 

Benchmarking with df of 10000 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0024342
min_time(t) = 0.0021453 
Regular code:
mean_time(t) = 0.0007169
min_time(t) = 0.0004661 

Benchmarking with df of 100000 rows and repeat = 1000
Pipe:
mean_time(t) = 0.00625
min_time(t) = 0.0046697
Regular code:
mean_time(t) = 0.0035322
min_time(t) = 0.0028939

Benchmarking with df of 1000000 rows and repeat = 10
Pipe:
mean_time(t) = 0.0471684
min_time(t) = 0.041649
Regular code:
mean_time(t) = 0.0343789
min_time(t) = 0.0331603

The results are quite similar: in most experiments, the .query() method was slower, with the biggest difference in performance observed for dataframes of 100,000 rows (over twice as slow).

Of course, we need to remember that such benchmarks are affected by multiple factors, and in another round of experiments, we could observe slightly different results. Therefore, for example, we should not conclude that the difference is bigger for dataframes of 100,000 rows than for a million rows, as this could change in another round of experiments. We can conclude, however, that the three piped Pandas methods are usually significantly slower than the corresponding typical Pandas code. Sometimes the difference is quite visible (e.g., the piped function is twice as slow), but usually, the difference in performance is less pronounced.

We've benchmarked the three most important piped Pandas methods, namely, .assign(), .filter(), and .query(). Indeed, when I'm creating Pandas pipes, these are the three methods I use most often.

We should treat these results with great caution. They only show that chained Pandas methods are generally slower, sometimes significantly, sometimes negligibly. Nevertheless, in real-life projects, especially when performance is at stake, you should profile your code and see how the particular set of operations performs. A general recommendation for optimizing performance is as follows: while you can make some general assumptions (e.g., chained Pandas operations will most likely be slightly slower), you should always profile your code to see if a particular set of operations creates a bottleneck or not.

These benchmarks focused solely on execution time. This is not because I ignore memory performance, but because my experiments showed that both approaches have more or less the same memory footprints. This doesn't mean that memory footprints will always be the same in the case of long chains – or even different methods. As with execution time, it's always good to double-check memory usage. For this, you can use the pd.DataFrame.memory_usage() method, one of the Python memory-profiling packages (e.g., psutil, memory_profiler, tracemalloc), or even IPython's ipython_memory_usage extension.

Conclusion

We've discussed the various pros and cons of piping Pandas operations. So, should you use them instead of typical Pandas code? As often, the answer depends on the context. While making a decision, you should consider various aspects of the project, such as:

  • Type of code: Is it for analysis, proof of concept (PoC), or production?
  • Performance vs. readability: Is performance more crucial than readability?
  • Comfort and preference: How comfortable are you with each syntax? Which do you prefer?
  • Audience: Who will read the code?

These questions are interrelated, and their answers should be considered together. Here are some guidelines:

  • Analysis code: Write in the style you are most comfortable with if the code won't be shared. If it will be shared (e.g., using notebooks), consider the audience (see below). There can be exceptions, however. When you're working with big data, you may need to optimize the code in terms of performance (in terms of time or memory or both, depending on the situation).
  • Production code: Balance readability and performance.
  • Readability focus: Piping can provide clearer code. On the other hand, if a code reader is unfamiliar with this syntax, they may need to spend some time learning what particular piped methods do and how they work.
  • Performance focus: Prefer vectorized operations over piping. Nonetheless, remember to profile the code before making significant changes to improve performance.
  • Personal preference: Use piping if it feels natural to you; avoid it if it feels unnatural – unless you want to learn it, in which case a real-life project offers a perfect opportunity to do so.
  • Educational code: Consider the project's context to decide between simplicity, readability, and education.

Weigh the pros and cons for your project to decide whether it's piping time. Remember, however, that piping is integral to Python, as Pandas isn't the only framework that enables you to pipe operations. When using Python OOP, you frequently use the dot operator to chain class methods. Thus, don't treat Pandas piping as something atypical or unnatural. The truth is, piping methods is at least as natural for Python as assigning values to temporary variables in consecutive operations.

Footnotes

¹ Actually, the pipe operator, %>%, originates from the [magrittr](https://magrittr.[tidyverse](https://www.tidyverse.org/).org/) package, which is now part of the tidyverse environment.

Appendix

The code for benchmarks. The two pieces of code to benchmark are kept in the code_regular and code_pipe variables. The code is run from the shell using two command-line variables:

  1. The number of rows in the dataframe.
  2. The repeat value passed to timeit.repeat (here kept in the rep variable).

For example, this call:

> python bench.py 1000 100_000

will create a Pandas Dataframe of 1000 rows and will run the benchmark with a repeat value of 100,000.

Note that the experiments are run using high values of repeat, but the value of number is always 1. This may seem atypical, but I did this on purpose. We're benchmarking operations on mutable dataframes, and in some of the operations, we need to assign the result. Hence, to be consistent and make fair comparisons, we're assigning the result in all benchmarked operations.

Normally, we assign the result of such operations to the very same variable (in the code, it's df). Thus, if the number value is 2 or higher, df would be different in the first run of the operations compared to the next runs. However, if we use number of 1 and a high value of repeat, the df is recreated before each run: the setup code is run, and then the benchmarked code is run only once with the output of the setup code. Hence, each run leads to the very same output, which is a requirement of good benchmarks.

Here's the code:

Python">import sys
import warnings

from timeit import repeat

warnings.filterwarnings("ignore", category=DeprecationWarning)

try:
    n = int(sys.argv[1])
except IndexError:
    n = 1

try:
    number = int(sys.argv[2])
except IndexError:
    number = 1000

print(f"Benchmarking with df of {n} rows and repeat = {number}")

setup = f"""
import pandas as pd
df = pd.DataFrame({{
    letter: list(range({n}))
    for letter in 'abcdefghijklmnopqrstuwxyz'
}})
mean_a = df.a.mean()
"""

import pandas as pd
df = pd.DataFrame({
    letter: list(range(10))
    for letter in 'abcdefghijklmnopqrstuwxyz'
})

code_pipe = f'df = df.assign(xy = df["x"] * 3 / (1 + df["y"]*.3))'
code_regular = f'df["xy"] = (df["x"] * 3) / (1 + df["y"]*.3)'

kwargs = dict(setup=setup, number=1, repeat=number)
t_pipe = repeat(code_pipe, **kwargs)
t_regular = repeat(code_regular, **kwargs)

def report(t: list[float], comment="", dig=7) -> None:
    def mean_time(x): return round(sum(x)/len(x), dig)
    def min_time(x): return round(min(x), dig)
    print(comment)
    print(
        f"{mean_time(t) = }"
        "n"
        f"{min_time(t) = }"
    )

report(t_pipe, "Pipe:")
report(t_regular, "Regular code:")

Tags: Data Science Deep Dives Pandas Pandas Dataframe Python

Comment