7 Examples to Master Categorical Data Operations with Python Pandas

Author:Murphy  |  View: 28242  |  Time: 2025-03-23 12:07:29
(image created by author)

Categorical variables can take on a value from a limited number of values, which are usually fixed. Here are some examples of categorical variables:

  • English skill level indicator (A1, A2, B1, B2, C1, C2)
  • Blood type of a person (A, B, AB, 0)
  • Demographic information such as race and gender
  • Education level

Pandas provides a dedicated data type of categorical variables ( category or CategoricalDtype ). Although such data can also be stored with object or string data types, there are several advantages of using the category data type. We'll learn about these advantages but let's first start with how to work with the categorical data.


When we create a Series or DataFrame with textual data, its data type becomes object by default. To use category data type, we need to explicitly define it.

Python">import pandas as pd

# create Series
blood_type = pd.Series(["A", "B", "AB", "0"])

print(blood_type)
# output
0     A
1     B
2    AB
3     0
dtype: object

# create Series with category data type
blood_type = pd.Series(["A", "B", "AB", "0"], dtype="category")

print(blood_type)
# output
0     A
1     B
2    AB
3     0
dtype: category
Categories (4, object): ['0', 'A', 'AB', 'B']

Although the values are the same, the data types are different as shown with dtype when you print the Series.

We'll go over 7 sets of examples to learn the following topics:

  1. Category data type in DataFrames
  2. Categories
  3. Adding and updating values
  4. Adding and removing categories
  5. Order among categories
  6. Renaming categories
  7. Advantages of using category data type

Example 1 – category data type in DataFrames

We can declare category data type when creating the Series or DataFrame as we did above. We can also convert them to category afterwards using the astype function.

In the code snippet below, we first create a DataFrame with two columns of object data type. Then, we change the data type of the blood_type column to category . Remember each column of a DataFrame is a Series.

# create a DataFrame with two columns
df = pd.DataFrame(
    {
        "name": ["Jane", "John", "Ashley", "Matt"],
        "blood_type": ["A", "B", "AB", "0"]
    }
)

# check the data types
df.dtypes
# output
name          object
blood_type    object
dtype: object

# convert the blood_type column to category
df["blood_type"] = df["blood_type"].astype("category")

# check the data types again
df.dtypes
# output
name            object
blood_type    category
dtype: object

Example 2 – categories

A Pandas Series of category data type is defined with the categories. By default, categories are determined as the unique values in the series.

# create Series with category dtype
brands = pd.Series(["Ford", "Toyota", "BMW"], dtype="category")

print(brands)
# output
0      Ford
1    Toyota
2       BMW
dtype: category
Categories (3, object): ['BMW', 'Ford', 'Toyota']

The categories are shown when we print the brands. We can also extract them using the categories method available via the cat accessor.

brands.cat.categories

# output
Index(['BMW', 'Ford', 'Toyota'], dtype='object')

It returns an index of the categories.

We can also define categories while creating the Series, which can be done as follows:

# create Series with category data type
brands = pd.Series(
    pd.Categorical(
        ["Ford", "Toyota", "BMW"], 
        categories=["Ford", "Toyota", "BMW", "Honda"]
    )
)

print(brands)
# output
0      Ford
1    Toyota
2       BMW
dtype: category
Categories (4, object): ['Ford', 'Toyota', 'BMW', 'Honda']

The value "Honda" does not exist in the Series currently but it can be added since it's listed among the categories.


Example 3 – adding and updating values

To add a new value in a Series of category data type or replace an existing one, we should pick a value from the defined categories. Otherwise, Pandas change the data type of the Series to object .

# create a Series with category data type
brands = pd.Series(["Ford", "Toyota", "BMW"], dtype="category")

print(brands)
# output
0      Ford
1    Toyota
2       BMW
dtype: category
Categories (3, object): ['BMW', 'Ford', 'Toyota']

# Add a new item of a different category
brands[3] = "Honda"

print(brands)
# output
0      Ford
1    Toyota
2       BMW
3     Honda
dtype: object

When we added the new item "Honda", which is not among the listed categories, we ended up with a Series of object data type.

If we try to change one of the existing values with a value different from the existing categories, Pandas will raise a type error.

# create a Series with category data type
brands = pd.Series(["Ford", "Toyota", "BMW"], dtype="category")

# replace the third value with Honda
brands[2] = "Honda"

# output
TypeError: Cannot setitem on a Categorical with a new category (Honda), set the categories first

There are different ways of fixing this problem. For instance, we can add "Honda" as a new category before using it in the Series.

# add Honda as a category
brands = brands.cat.add_categories("Honda")

# replace the third value with Honda
brands[2] = "Honda"

print(brands)

# output
0      Ford
1    Toyota
2     Honda
dtype: category
Categories (4, object): ['BMW', 'Ford', 'Toyota', 'Honda']

Example 4— adding and removing categories

We can add multiple categories at once using a Python list.

# create Series with category data type
sizes = pd.Series(["S", "M", "L"], dtype="category")

# add two new categories
sizes = sizes.cat.add_categories(["XS", "XL"])

print(sizes)

# output
0    S
1    M
2    L
dtype: category
Categories (5, object): ['L', 'M', 'S', 'XS', 'XL']

Just like we can add new categories, it is possible to remove existing categories.

# create Series with category data type
sizes = pd.Series(["S", "M", "L", "XL", "XXL"], dtype="category")

# remove XL and XXL categories
sizes = sizes.cat.remove_categories(["XL", "XXL"])

print(sizes)

# output
0      S
1      M
2      L
3    NaN
4    NaN
dtype: category
Categories (3, object): ['L', 'M', 'S']

It is important to note that if the Series includes values that belong to a removed category (i.e. a category that no longer exists), these values become a missing value (i.e. NaN).

We can use the categories method to extract the existing categories from a Series.

# create Series with category data type
sizes = pd.Series(["S", "M", "M", "L", "L", "S"], dtype="category")

# extract categories
sizes.cat.categories 

# output
Index(['L', 'M', 'S'], dtype='object')

# extract categories as a list
list(sizes.cat.categories)

# output
['L', 'M', 'S']

Example 5— order among categories

In some cases, there is an order among categories (e.g. S < M < L). There are different ways of enforcing such as order.

One option is to use the as_ordered function to add an order to an existing Series of categorical data.

# create Series with category data type
sizes = pd.Series(["L", "S", "XL", "M", "L", "S"], dtype="category")

# convert it to ordered
sizes = sizes.cat.as_ordered()

print(sizes)

# output
0     L
1     S
2    XL
3     M
4     L
5     S
dtype: category
Categories (4, object): ['L' < 'M' < 'S' < 'XL']

We now see an order among the categories but it's wrong. Pandas assigns alphabetical order for the string data, which actually makes sense. We can fix this by reordering the categories (check the next example).

The sizes Series in the previous example have ordered categories but with a a wrong order. Let's fix it using the reorder_categories method.

# convert it to ordered
sizes = sizes.cat.reorder_categories(["S", "M", "L", "XL"])

print(sizes)

# output
0     L
1     S
2    XL
3     M
4     L
5     S
dtype: category
Categories (4, object): ['S' < 'M' < 'L' < 'XL']

We write the categories in the desired order inside a Python list and pass it to the reorder_categories method.

To remove the order from the categories, we can use the as_unordered method. Let's apply it to the sizes Series created in the previous example.

# convert it to unordered
sizes = sizes.cat.as_unordered()

print(sizes)

# output
0     L
1     S
2    XL
3     M
4     L
5     S
dtype: category
Categories (4, object): ['L', 'M', 'S', 'XL']

It is also possible to enforce the order while creating the Series using the ordered parameter.

# create Series with category data type
divisions = pd.Series(pd.Categorical(

    values=["C", "C", "A", "B", "A", "C", "A"], 
    categories=["C", "B", "A"], 
    ordered=True

))

print(divisions)

# output
0    C
1    C
2    A
3    B
4    A
5    C
6    A
dtype: category
Categories (3, object): ['C' < 'B' < 'A']

The ordering is determined based on the order in which we write the categories (in this example, it's C, B, A).


Example 6— renaming categories

We can use the rename_categories method if we need to rename the categories.

In the previous example, we created a Series called "division" with the categories C, B, and A. Let's rename these categories.

# rename the categories
divisions = divisions.cat.rename_categories(["group C", "group B", "group A"])

print(divisions)

# output
0    group C
1    group C
2    group A
3    group B
4    group A
5    group C
6    group A
dtype: category
Categories (3, object): ['group C' < 'group B' < 'group A']

As we see in the output, renaming categories also updates the values in the Series.


Example 7— Advantages of using category data type

The main data structure of Pandas is DataFrame, which is a two-dimensional data structure with labeled rows and columns. Each column in a DataFrame is also a Series object. Thus, we can easily use categorical data types in a DataFrame.

In this example, we'll create a sample DataFrame and then add a new column by changing the data type of an existing column to category.

import numpy as np

# create a DataFrame with 100000 rows
cars = pd.DataFrame({

    "id": np.arange(1, 100001),
    "brand": ["Ford", "Toyota", "BMW", "Tesla"] * 25000,
    "price": np.random.randint(10000, 20000, size=100000)

})

# add a brand_categorical column
cars["brand_categorical"] = cars["brand"].astype("category")

# check the data types
cars.dtypes

# output
id                      int64
brand                  object
price                   int64
brand_categorical    category
dtype: object

The DataFrame we created looks like below. The brand and brand_categorical columns store the same data but with different data types.

The first 5 rows of the cars DataFrame (image by author)

What is the purpose of using categorical data type over object or string data types? The data is the same anyways.

The answer is the memory usage. Especially if the number of distinct values is much less than the total number of values (low-cardinality), you'll save a ton of memory space by using category data type instead of object.

Let's confirm by calculating the memory usage of the columns in the cars DataFrame.

# check the data types
cars.memory_usage()

# output
Index                   132
id                   800000
brand                800000
price                800000
brand_categorical    100204
dtype: int64

It calculates the memory usage in bytes. We use 8 times less memory with category data type compared to the object data type. This difference matters more when we work with larger datasets (e.g. millions of rows).


Final words

Category data type is relatively less common than other string-based data types. The reason might be that we usually encode string-based data before using it in a machine learning model. However, even for data cleaning and preparation, category data type offers important advantages. Thus, if a string-based variable contains a few distinct values compared to the total values, I strongly recommend using category data type.

Thank you for reading. Please let me know if you have any feedback.

Tags: Data Analysis Data Science Pandas Python Tips And Tricks

Comment