3 Silent Pandas Mistakes You Should Be Aware Of

Author:Murphy  |  View: 25089  |  Time: 2025-03-23 13:03:23

"The mistakes of the fool are known to the world, but not to himself. The mistakes of the wise man are known to himself, but not to the world." – Charles Caleb Colton

Not knowing the mistakes we make in programming does not necessarily make us a fool. However, it may result in undesired consequences.

Some mistakes shine like a diamond and can be recognized from miles away. Even if you don't notice them, compilers (or interpreters) inform us about them by raising errors.

On the other hand, there exist some "silent" mistakes that are hard to notice but have the potential to cause serious issues.

They don't result in any errors but make the function or operation to execute things in a different way than you think it would. Hence, the outcome changes without you noticing.

We'll learn about three of such issues.


You're a data analyst working at a retail company. You've been asked to analyze the results of a recently run series of promotions. One of the tasks in this analysis is calculating the total sales quantities for each promotion and the grand total.

Let's say the promotion data is stored in a DataFrame that looks like the following (definitely not this small in real life):

promotion DataFrame (image by author)

And here is the Pandas code to create this DataFrame if you'd like to follow along and do the examples on your own:

import pandas as pd

promotion = pd.DataFrame(
    {
        "promotion_code": ["A2", "A1", "A2", "B1", "A2", None, "A2", "B1", None, "A1"],
        "sales_qty": [34, 32, 26, 71, 44, 27, 64, 33, 45, 90],
        "price": [24.5, 33.1, 64.9, 52.0, 29.0, 47.5, 44.2, 25.0, 42.5, 30.0]
    }
)

Calculating the total sales quantity per promotion code is a piece of cake. You just need to use the groupby function:

promotion.groupby("promotion_code").agg(

    total_promo_sales = ("sales_qty", "sum")

)

# output         
promotion_code    total_promo_sales
            A1                  122
            A2                  168
            B1                  104

When you add these up to get the ground total, you find 394 but the correct value of the total sales quantity is 466.

promotion_sales = promotion.groupby("promotion_code").agg(
    total_promo_sales = ("sales_qty", "sum")
)

promotion_sales.total_promo_sales.sum() # output 394

promotion.sales_qty.sum() # output 466

The reason of the difference is the missing promotion code values (i.e. None). For some reason, some of the promotion code values haven't been recorded.

The groupby function ignores the missing values by default. To include them in the production, you need to set the dropna parameter to False .

promotion.groupby("promotion_code", dropna=False).agg(

    total_promo_sales = ("sales_qty", "sum")

)

# output         
promotion_code    total_promo_sales
            A1                  122
            A2                  168
            B1                  104
           NaN                   72

Our sample DataFrame has only 10 rows so we're able to spot the missing values. However, you'll likely to work with much larger DataFrames (thousands or millions of rows), which makes it impossible to do a visual inspection.

Always keep the missing values in mind and check for them.


The second silent mistake we'll talk about is chained indexing.

Indexing on Pandas DataFrames is quite useful, which is mainly used for getting and setting subsets of data.

We can use row and column labels as well as their index values to access a particular set of rows and labels.

Consider our promotion DataFrame from the previous examples. Let's say we want to update the sales quantity value in the second row. Here is one way of doing this:

Python">promotion["sales_qty"][1] = 45

We first select the sales quantity column and then select the second row whose index (and also label) is 1. This is called "chained indexing" and should be avoided.

When you execute this line of code, you'll get a SettingWithCopyWarning . The operation is done as expected (i.e. the value is updated as 45) but we should not ignore this warning.

According to Pandas documentation, "assigning to the product of chained indexing has inherently unpredictable results.". The main reason is that we can't be sure whether the indexing operation will return a view or copy. Thus, the value we try to update may or may not be updated.

The better (and guaranteed) way of doing this operation is using the loc method, which is guaranteed to operate on the DataFrame directly.

Here is how we'd update the value in the second row of the sales quantity column:

promotion.loc[1, "sales_qty"] = 46

The third silent mistake is related to a difference between the loc and iloc methods. These methods are used for selecting a subset from a DataFrame.

  • loc: select by labels of rows and columns
  • iloc: select by positions of rows and columns

By default, Pandas assign integer values (starting from 0) as row labels. As a result, row labels and index values become the same.

Let's do a simple example on our promotion DataFrame. It's quite small but enough to demonstrate the issue I'm about to explain.

Consider a case where we need to select the first 4 rows. Here is how we'd do it with the iloc method:

promotion.iloc[:4, :]

# output
  promotion_code sales_qty price
0             A2        34  24.5
1             A1        32  33.1
2             A2        26  64.9
3             B1        71  52.0

The part before the comma determines the rows to be selected and the part after the comma is for columns (":" means all the columns).

Let's do the same operation with the loc method. Since the row labels and index values are the same, we can use the same code (just need to change iloc to loc).

promotion.loc[:4, :]

# output
  promotion_code sales_qty price
0             A2        34  24.5
1             A1        32  33.1
2             A2        26  64.9
3             B1        71  52.0
4             A2        44  29.0

Well, the output is not the same. When we use the loc method, we get an extra row.

The reason is that when using the loc method, the upper bound is inclusive to the last row (which is the row with label 4) is included.

When we use the iloc method, the upper bound is exclusive so the row with index 4 is not included.

This may be a small difference but can definitely result in unexpected results and has the potential to mislead your analysis.

The loc and iloc methods come in handy for many tasks but you should know the difference between them.


The mistakes that raise an error are important but we immediately take necessary actions to fix them.

What's more dangerous is the one we're not aware of. They tend to cause indirect effects and hidden failures. In this article, we've learned about three of such mistakes.

Thank you for reading. Happy learning!

Tags: Artificial Intelligence Data Science Pandas Dataframe Programming Python

Comment