Write DRY data models with partials and Pydantic

Author:Murphy  |  View: 22705  |  Time: 2025-03-23 19:50:15
Photo by Didssph on Unsplash

Introduction

Pydantic is an incredibly powerful library for Data Modeling and validation that should become a standard part of your data pipelines.

Part of what makes them so powerful is they can easily accommodate nested data files like JSON.

For a quick refresher, Pydantic is a Python library that lets you define a data model in a Pythonic way, and use that model to validate data inputs, mainly through type hints.

Pydantic type-hints are much stronger and can be much more customized than standard library ones.

Partials on the other hand let you preload a function call with specific args and kwargs, which is particularly helpful if you're going to call the same function with the same parameters numerous times.

In a previous article, I talked about using Enums to define valid string inputs for validating.

We can take this further by incepting our Pydantic data model with other Pydantic data models.

The Data

I created our sample data using various random name generators. The dataset represents characters in a Dungeons and Dragons-type game.

Let's start with an inspection of our data:

Image by author, generated with Carbon

As we can see, rather than a flat data structure, we now have data within data.

Inspecting our previous data model we can see that we now have to make some changes to accommodate the nested data structures:

import pydantic
class RpgCharacterModel(pydantic.BaseModel):
    CREATION_DATE: datetime
    NAME: str = pydantic.Field(...)
    GENDER: GenderEnum
    RACE: RaceEnum = pydantic.Field(...)
    CLASS: ClassEnum = pydantic.Field(...)
    HOME: str
    GUILD: str
    PAY: int = pydantic.Field(..., ge=1, le=500)

Visualizing the problem

One tool I find helpful for exploring nested data structures is JSONCrack, which provides a fantastic visualization of JSON data:

image downloaded from JSON crack under the GNU General Public License v3.0

We can see that 4 other models underneath it support our top-level model.

DRY defining our model:

Using the fields we can easily see from JSON crack, we can make our first pass at the model like this.

import pydantic
class RpgRaceModelBasic(pydantic.BaseModel):
    RACE_ID: int = pydantic.Field(..., ge=10, le=99)
    RACE: RaceEnum = pydantic.Field(...)
    HP_MODIFIER_PER_LEVEL: int = pydantic.Field(..., ge=-6, le=6)
    STR_MODIFIER: int = pydantic.Field(..., ge=-6, le=6, description="Character's strength")
    CON_MODIFIER: int = pydantic.Field(..., ge=-6, le=6, description="Character's strength")
    DEX_MODIFIER: int = pydantic.Field(..., ge=-6, le=6, description="Character's strength")
    INT_MODIFIER: int = pydantic.Field(..., ge=-6, le=6, description="Character's strength")
    WIS_MODIFIER: int = pydantic.Field(..., ge=-6, le=6, description="Character's strength")
    CHR_MODIFIER: int = pydantic.Field(..., ge=-6, le=6, description="Character's strength")

pydantic.Field() lets us specify additional parameters for our model beyond type hints.

  • ... indicates that it is a required field.
  • ge indicates that the field must be greater than or equal to this value.
  • le indicates that the field must be less than or equal to this value.

However, there is also a lot of repeated code in the fields defining our attribute modifiers. 7 of our modifiers must be between -6 and 6. Consequently, in future changes we'd have to make changes to 7 lines of code.

We can simplify our definitions using a partial function from the functools library. What a partial does is allow us to pin parameters in place for a function and they're perfect for a situation like this where we are calling the same function with the same arguments over and over again:

import functools
import pydantic

id_partial = functools.partial(pydantic.Field, ..., ge=10, le=99)
attribute_partial = functools.partial(pydantic.Field, ..., ge=-6, le=6)

class RpgRaceModelDry(pydantic.BaseModel):
    RACE_ID: int = id_partial(..., ge=10, le=99)
    RACE: RaceEnum = pydantic.Field(...)
    HP_MODIFIER_PER_LEVEL: int = attribute_partial()
    STR_MODIFIER: int = attribute_partial(description="Character's strength")
    CON_MODIFIER: int = attribute_partial(description="Character's constitution")
    DEX_MODIFIER: int = attribute_partial(description="Character's dexterity")
    INT_MODIFIER: int = attribute_partial(description="Character's intelligence")
    WIS_MODIFIER: int = attribute_partial(description="Character's wisdom")
    CHR_MODIFIER: int = attribute_partial(description="Character's charisma")

We now have a single place we can make changes to our code if we ever need to change the range of the attribute modifiers. This is a much cleaner and DRYer way of writing code.

Also, notice that we can still pass other parameters to the partial function, in this case description

Partials are also flexible and let you override keyword arguments you've already passed to them.

Finishing out the model:

We now have to define our other three models as well as a couple of enums to go with them:

import enum
import pydantic

class AttributeEnum(enum.Enum):
    STR = 'STR'
    DEX = 'DEX'
    CON = 'CON'
    INT = 'INT'
    WIS = 'WIS'
    CHR = 'CHR'

class AlignmentEnum(enum.Enum):
    LAWFUL_GOOD = 'Lawful Good'
    LAWFUL_NEUTRAL = 'Lawful Neutral'
    LAWFUL_EVIL = 'Lawful Evil'
    NEUTRAL_GOOD = 'Neutral Good'
    TRUE_NEUTRAL = 'True Neutral'
    NEUTRAL_EVIL = 'Neutral Evil'
    CHAOTIC_GOOD = 'Chaotic Good'
    CHAOTIC_NEUTRAL = 'Chaotic Neutral'
    CHAOTIC_EVIL = 'Chaotic Evil'

class RpgClassModel(pydantic.BaseModel):
    CLASS_ID: int = id_partial()
    CLASS: ClassEnum = pydantic.Field(...)
    PRIMARY_CLASS_ATTRIBUTE: AttributeEnum = pydantic.Field(...)
    SPELLCASTER: bool = pydantic.Field(...)

class RpgPolityModel(pydantic.BaseModel):
    KINGDOM_ID: int = id_partial(ge=100, le=999)
    POLITY: str = pydantic.Field(...)
    TYPE: str = pydantic.Field(...)

class RpgGuildModel(pydantic.BaseModel):
    GUILD_ID: int = id_partial(ge=100, le=999)
    GUILD: str = pydantic.Field(...)
    ALIGNMENT: AlignmentEnum
    WEEKLY_DUES: int = pydantic.Field(..., ge=10, le=100)

Pay particular attention to the KINGDOM_ID and GUILD_ID. We overrode the ge and le arguments in the partial function which is ok. It still preserves the ... which indicates it's a required field.

By calling the partial function on our ID columns we never have to worry about forgetting to make them required fields.

Defining our top-level model

Now all the supporting models have been built, we can define our top-level model which looks like this:

import pydantic
class RpgCharacterModel(pydantic.BaseModel):
    CREATION_DATE: datetime
    NAME: str = pydantic.Field(...)
    GENDER: GenderEnum
    RACE_NESTED: RpgRaceModelDry = pydantic.Field(...)
    CLASS_NESTED: RpgClassModel = pydantic.Field(...)
    HOME_NESTED: RpgPolityModel
    GUILD_NESTED: RpgGuildModel
    PAY: int = pydantic.Field(..., ge=1, le=500)

Focus on HOME_NESTED and GUILD_NESTED: Notice how they aren't required in our top-level model but those models have fields within them that are required. That means that if you pass data to the field, it must conform to the model, but if you don't pass the data to the field, it is still considered valid.

That effectively means you can pass to it a valid model or nothing.

Conclusion

Combining functools.partial() with Pydantic data models can do a lot to make your code cleaner, easier to understand and ensure that you're properly handling invalid data in a scalable way.

In our example, we only built a single level of nesting, but you can nest repeatedly to manage any gnarly JSON object you encounter.

Likewise, a well-built data model gives downstream consumers confidence that the data you're sending to them is exactly what they're expecting.

How do you think you can apply these techniques to your data pipelines?

About

Charles Mendelson is a Data Engineer working at PitchBook data. If you would like to get in touch with him, the best way is on LinkedIn.

All the code:

# Standard Library imports
from datetime import datetime
import enum
import functools

# 3rd Party package imports
import pydantic

# Enums for limiting string data in our model

class GenderEnum(enum.Enum):
    M = 'M'
    F = 'F'
    NB = 'NB'

class ClassEnum(enum.Enum):
    Druid = 'Druid'
    Fighter = 'Fighter'
    Warlock = 'Warlock'
    Ranger = 'Ranger'
    Bard = 'Bard'
    Sorcerer = 'Sorcerer'
    Paladin = 'Paladin'
    Rogue = 'Rogue'
    Wizard = 'Wizard'
    Monk = 'Monk'
    Barbarian = 'Barbarian'
    Cleric = 'Cleric'

class RaceEnum(enum.Enum):
    Human = 'Human'
    Dwarf = 'Dwarf'
    Halfling = 'Halfling'
    Elf = 'Elf'
    Dragonborn = 'Dragonborn'
    Tiefling = 'Tiefling'
    Half_Orc = 'Half-Orc'
    Gnome = 'Gnome'
    Half_Elf = 'Half-Elf'

class AttributeEnum(enum.Enum):
    STR = 'STR'
    DEX = 'DEX'
    CON = 'CON'
    INT = 'INT'
    WIS = 'WIS'
    CHR = 'CHR'

class AlignmentEnum(enum.Enum):
    LAWFUL_GOOD = 'Lawful Good'
    LAWFUL_NEUTRAL = 'Lawful Neutral'
    LAWFUL_EVIL = 'Lawful Evil'
    NEUTRAL_GOOD = 'Neutral Good'
    TRUE_NEUTRAL = 'True Neutral'
    NEUTRAL_EVIL = 'Neutral Evil'
    CHAOTIC_GOOD = 'Chaotic Good'
    CHAOTIC_NEUTRAL = 'Chaotic Neutral'
    CHAOTIC_EVIL = 'Chaotic Evil'

# partial function for siloing our logic in one place:
id_partial = functools.partial(pydantic.Field, ..., ge=10, le=99))
attribute_partial = functools.partial(pydantic.Field, ..., ge=-6, le=6)

# models that make up our main model

class RpgRaceModelDry(pydantic.BaseModel):
    RACE_ID: int = id_partial()
    RACE: RaceEnum = pydantic.Field(...)
    HP_MODIFIER_PER_LEVEL: int = attribute_partial()
    STR_MODIFIER: int = attribute_partial()
    CON_MODIFIER: int = attribute_partial()
    DEX_MODIFIER: int = attribute_partial()
    INT_MODIFIER: int = attribute_partial()
    WIS_MODIFIER: int = attribute_partial()
    CHR_MODIFIER: int = attribute_partial()

class RpgClassModel(pydantic.BaseModel):
    CLASS_ID: int = id_partial()
    CLASS: ClassEnum = pydantic.Field(...)
    PRIMARY_CLASS_ATTRIBUTE: AttributeEnum = pydantic.Field(...)
    SPELLCASTER: bool = pydantic.Field(...)

class RpgPolityModel(pydantic.BaseModel):
    KINGDOM_ID: int = id_partial(ge=100, le=999)
    POLITY: str = pydantic.Field(...)
    TYPE: str = pydantic.Field(...)

class RpgGuildModel(pydantic.BaseModel):
    GUILD_ID: int = id_partial(ge=100, le=999)
    GUILD: str = pydantic.Field(...)
    ALIGNMENT: AlignmentEnum
    WEEKLY_DUES: int = pydantic.Field(..., ge=10, le=100)

# Our top level model:
class RpgCharacterModel(pydantic.BaseModel):
    CREATION_DATE: datetime
    NAME: str = pydantic.Field(...)
    GENDER: GenderEnum
    RACE_NESTED: RpgRaceModelDry = pydantic.Field(...)
    CLASS_NESTED: RpgClassModel = pydantic.Field(...)
    HOME_NESTED: RpgPolityModel
    GUILD_NESTED: RpgGuildModel
    PAY: int = pydantic.Field(..., ge=1, le=500)

# We didn't talk about this one, but from a previous article,
# this validates each row of data

def validate_data(list_o_dicts, model: pydantic.BaseModel, index_offset: int = 0):
    list_of_dicts_copy = list_o_dicts.copy()
    #capturing our good data and our bad data
    good_data = []
    bad_data = []
    for index, row in enumerate(list_of_dicts_copy):
        try:
            model(**row)  #unpacks our dictionary into our keyword arguments
            good_data.append(row)  #appends valid data to a new list of dictionaries
        except pydantic.ValidationError as e:
            # Adds all validation error messages associated with the error
            # and adds them to the dictionary
            row['Errors'] = [error_message for error_message in e.errors()]
            row['Error_row_num'] = index + index_offset
            #appends bad data to a different list of dictionaries
            bad_data.append(row)

    return (good_data, bad_data)

Originally published at https://charlesmendelson.com on February 23, 2023.

Tags: Data Engineering Data Modeling Data Science Functional Programming Pydantic

Comment