Electric Cars In the Netherlands: Exploratory Data Analysis with Python and SQLAlchemy (Part 2)

Author:Murphy | View: 21402 | Time: 2025-03-23 19:27:25

Smart EQ Car, Image Source https://en.wikipedia.org/wiki/Smart_electric_drive

When was the first electric car registered? (Spoiler: it was much earlier than most people may think.) Which Cars are more expensive, the electric Porcshe or Jaguar? Exploratory data analysis (EDA) is not only an important part of building every data pipeline, but it is also a pretty interesting process. In the first part, I analyzed the RDW (Netherlands Vehicle Authority) dataset with Python and Pandas, and one of the challenges was the large (about 10 GB) dataset size. As a workaround, I specified the columns list, which needs to be loaded in Pandas. It works, but what if the dataset is larger, and there is still not enough RAM to put all the data in memory, or if the dataset is placed on a remote database? In this article, I will show how to make a similar analysis using SQLAlchemy. This will allow making "heavy" data processing using SQL, without the need to load all data in Pandas.

Let's get started.

Loading the data

The RDW ("Rijks Dienst Wegverkeer", https://www.rdw.nl) is a Dutch organization that handles approvals and registration of motorized vehicles and driving licenses in the Netherlands. I will use the "Gekentekende voertuigen" ("Vehicles with license plates") dataset. As was described in the first part, it is available under a Public Domain license and can be downloaded from opendata.rdw.nl. For data processing, I will be using SQLite, which is a free and lightweight database engine, it can easily run on any PC.

In the beginning, we need to download and import the CSV file to SQLite. The file size is about 10 GB; it can be downloaded for free and no registration is required. To import the data, I run the "_sqlite3 rdwdata.db" command, and enter 3 commands:

sqlite> .mode csv
sqlite> .import Open_Data_RDW__Gekentekende_voertuigen.csv rdw_data
sqlite> .quit

Here "Open_Data_RDW__Gekentekende_voertuigen.csv" is the original CSV file, and "rdw_data" is a table, which should be created. The import process takes some time, and after that, we are done with the command line and can return to Jupyter Lab. First, let's make the needed imports and see which database columns we have:

Python">from sqlalchemy import create_engine, MetaData, table, column, select, func
from sqlalchemy import inspectp
import pandas as pd

rdw_db = create_engine('sqlite:///rdw_data.db')

table_name = 'rdw_data'

with Session(rdw_db) as session:    
    insp = inspect(rdw_db)
    columns = insp.get_columns("rdw_data")
    display(pd.DataFrame(columns))

I am using Pandas DataFrame for displaying the results because its output is much more convenient to read. For example, the "display(columns)" will show the output like this:

At the same time, the "display(pd.DataFrame(columns))" output looks much better:

Let's check what we get. We can see that all columns have TEXT type, so we will need to convert the values. There are 91 columns in the database, but from the actual analysis, I need only the car's type, license plate, model name, price, and registration date. I will also use a "Number of cylinders" as a helper to detect if the car is electric or not. Last but not least, I am going to analyze only "personal" ("Personenauto" in Dutch) cars, not tracks or buses, so I will use this filter in the SQL query.

Let's do this conversion using SQL:

with Session(rdw_db) as session:  
    session.execute(text('DROP TABLE IF EXISTS rdw_cars'))
    session.execute(text('CREATE TABLE rdw_cars("index" INTEGER PRIMARY KEY AUTOINCREMENT, '
                                                         '"Model" TEXT, '
                                                         '"Trade name" TEXT, '
                                                         '"License Plate" TEXT, '
                                                         '"Number of Cylinders" INTEGER, '
                                                         '"Catalog price" INTEGER, '
                                                         '"First registration NL" TEXT, '
                                                         '"Is electric" INTEGER DEFAULT 0)'))

    session.execute(text('BEGIN TRANSACTION'))    
    session.execute(text('INSERT INTO rdw_cars("Model", "Trade name", "License Plate", "Number of Cylinders", "Catalog price", "First registration NL") '
                                           'SELECT '
                                              '"Merk", '
                                              '"Handelsbenaming", '
                                              '"Kenteken", '
                                              '(CASE WHEN LENGTH("Aantal cilinders") > 0 THEN CAST("Aantal cilinders" as INTEGER) ELSE NULL END), '
                                              '(CASE WHEN LENGTH("Catalogusprijs") > 0 THEN CAST("Catalogusprijs" as INTEGER) ELSE NULL END), '
                                              'DATE(SUBSTR("Datum eerste tenaamstelling in Nederland", 1, 4) || "-" || SUBSTR("Datum eerste tenaamstelling in Nederland", 5, 2) || "-" || SUBSTR("Datum eerste tenaamstelling in Nederland", 7, 2)) '
                                           ' FROM rdw_data WHERE "Voertuigsoort" = "Personenauto"'))  

    session.execute(text('COMMIT'))

Here I created a new table and converted Integer and Date columns into proper formats. I replaced all empty strings with NULLs, and as a bonus for readers, I translated Dutch column names to English. I also created the "Is electric" column, which I will use later.

The initial conversion is done, and we are ready to go.

Basic analysis

In the beginning, let's see the main properties of the dataset, like data samples, dimensionality, and the number of NULL values.

Using SQL, we can get the total amount of records:

with Session(rdw_db) as session:    
    q = session.execute(text('SELECT COUNT(*) FROM rdw_cars')).scalar()
    print("Cars total:", q)

In total, we have 9,487,265 cars, registered in the Netherlands at the time of writing this article (for those readers, who will download the dataset later, this number will obviously, be bigger). The total amount is also equal to the number, I got in the first part, where I did a similar analysis using Pandas-only – it's an easy way to check if the processing is right.

Let's now see the first 5 samples in the database; it is easy to do using SQL. Here and later I will be using Pandas to display tables, it's convenient because Pandas has native SQL binding:

with Session(rdw_db) as session:
    df = pd.read_sql_query(text("SELECT * FROM rdw_cars LIMIT 5"), con=session.connection(), dtype={'Catalog price': pd.UInt32Dtype(), 'Number of Cylinders': pd.UInt32Dtype()})
    display(df.style.hide(axis="index"))

The result looks like this:

Let's check how many missing/NULL numbers we have in different columns. Pandas DataFrame has a convenient method "df.isna().sum()", but I was not able to find something similar in SQL. We need to specify all the columns, we need to check:

with Session(rdw_db) as session:
    request = ('SELECT '
               ' SUM(CASE WHEN "Model" IS NULL OR "Model" = "" THEN 1 ELSE 0 END) AS model_no_data, '
               ' SUM(CASE WHEN "Trade name" = "" THEN 1 ELSE 0 END) AS trade_name_empty, '
               ' SUM(CASE WHEN "Trade name" IS NULL THEN 1 ELSE 0 END) AS trade_name_nulls, '
               ' SUM(CASE WHEN "License Plate" IS NULL OR "License Plate" = "" THEN 1 ELSE 0 END) AS lp_no_data, '
               ' SUM(CASE WHEN "Number of Cylinders" = 0 THEN 1 ELSE 0 END) AS num_cylinders_zeros, '
               ' SUM(CASE WHEN "Number of Cylinders" IS NULL THEN 1 ELSE 0 END) AS num_cylinders_nulls, '
               ' SUM(CASE WHEN "Catalog price" = 0 THEN 1 ELSE 0 END) AS price_zeros, '
               ' SUM(CASE WHEN "Catalog price" IS NULL THEN 1 ELSE 0 END) AS price_nulls, '
               ' SUM(CASE WHEN "First registration NL" IS NULL THEN 1 ELSE 0 END) AS registration_nulls, '
               ' COUNT(*) AS total '
               'FROM rdw_cars')

    df = pd.read_sql(text(request), con=session.connection())
    display(df.style.hide(axis="index"))

Using SQL, I calculate the sum of values, which can be NULL or empty. The result looks like this:

Here we can see the total amount of cars (9,487,265). Every car has a license plate and a registration date; those fields are probably mandatory for registration. But 2,480,506 records do not have a price. 864 do not have a "trade name", and so on. And here I can see a problem – those 864 records with empty "trade name" fields do not match the 1,405 empty records, I got in Pandas in the first part. This is obviously, not right, where is the difference? It is impossible to check 9,487,265 records manually, and the easiest way to debug the problem was to save unique "trade name" values to the text file and to compare both files using the "Diff" tool. And it turned out, that the problem was simple but interesting – in the first part, I was using the "pd.read_csv" method to load the data. This method is "smart enough" to automatically replace "NULL", "NA", "N/A" and some other values (a full list can be found in the manual) with NULLs, and this conversion is enabled by default. In our case, Mazda NA is a real car model, and Pandas automatically converted all names "NA" to NULL for those cars (it also reminds me of the old story about the man with the last name Null, who was "invisible for computers";). Anyway, the Mazda NA car is not electric, so it does not affect the results of the first part, but it is good to keep in mind that such problems may occur.

But let's return to analysis. Using SQL, we can easily make useful requests, for example, let's see the top 10 most expensive cars in the Netherlands:

with Session(rdw_db) as session:
    df = pd.read_sql(text('SELECT "Model", "Trade name", "Catalog price", "First registration NL" FROM rdw_cars ORDER BY "Catalog price" DESC LIMIT 10'), con = session.connection())
    display(df)

The result is interesting:

I was expecting to see Porsche, Mercedes, or BMW on this list, but having Peugeot or Fiat here is surprising for me, though, I'm not an expert in luxury cars.

Data transform

We already did some basic analysis using SQL requests, but the purpose of this article is to analyze electric cars. To detect if the car is electric or not, we need to know its manufacturer and model name. Ideally, if electric cars would have "ELECTRIC" in their name, the task would be straightforward. But in the real life, there is no logic in the model naming at all. "Mazda MX-30" is electric, but "Mazda MX-5" is not. "Kia Niro" is electric and "Kia Sorento" is not, and so on. The is no rule in this, and the easiest way is just to create a table of electric car models and use it. But first, let's check if car models and trade names in the dataset are consistent.

Let's verify car models first, for example, let's see all PEUGEOT cars:

with Session(rdw_db) as session:
    df = pd.read_sql_query(text('SELECT "Model", COUNT(*) AS Count FROM rdw_cars WHERE "Model" LIKE "%PEUGEOT%" GROUP BY "Model" '), con = session.connection())
    display(df.style.hide(axis="index"))

The result looks like this:

Almost all cars in the database have the name "PEUGEOT", which is fine, but several cars have longer names, like "PEUGEOT BOXER". The first word is enough to know the car model, so we can easily remove the rest. And it will be more convenient for future analysis; for example, we can group cars by model and see how many Peugeot cars were sold. In the first part, I already created a method to remove redundant words from the model name:

def model_normalize(s_val):
    """ "PEUGEOT BOXER/GLOBE-TRAVE " => "PEUGEOT" """
    if s_val and isinstance(s_val, str) and len(s_val) > 0:
        return s_val.replace("-", " ").replace("/", " ").split()[0].upper().strip()
    return None

Let's now check the next column. Car trade names in the dataset can sometimes be duplicated with models, like the "NISSAN" car in this example:

I created a method to remove these duplicates, in this example, it will convert the "NISSAN MURANO" field to just "MURANO".

def name_normalize(model: str, trade_name: str):
    """ Remove duplicates and convert the name to upper case """
    if isinstance(trade_name, str) and len(trade_name) > 0:
        name = trade_name.upper().strip()
        # Remove duplicates from model and trade name: 
        # ("TESLA", "TESLA MODEL 3") => ("TESLA", "MODEL 3")
        if name.split()[0] == model:
            # "TESLA MODEL 3" => [TESLA, MODEL, 3] => "MODEL 3"
            return ' '.join(name.split()[1:])  
        return name
    return None

Now we can finally figure out if the car is electric or not. In the first part, I already created a method for that:

electric_cars = {
    "AIWAYS": ['U5', 'U6'],
    "AUDI": ['E-TRON'],
    "BMW": ['I3', 'I4', 'I7', 'IX'],
    "CITROEN": ['E-C4'],
    "FIAT": ['500E', 'ELETTRA'],
    "FORD": ['MACH-E'],
    "HONDA": ['"E"', '"E ADVANCE"'],  
    "HYUNDAI": ['IONIQ', 'KONA'],
    "JAGUAR": ['I-PACE'],
    "KIA": ['NIRO', 'E-SOUL'],
    "LEXUS": ['RZ'],
    "LUCID": ['AIR'],
    "MAZDA": ['MX-30'],
    "MERCEDES": ['EQA', 'EQB', 'EQC', 'EQS', 'EQV'],
    "MG": ['ZS EV'],
    "MINI": ['COOPER SE'],
    "NISSAN": ['ALTRA', 'ARIYA', 'EVALIA', 'LEAF', 'NUVU'],
    "OPEL": ['AMPERA-E', 'COMBO-E', 'CORSA-E', 'MOKKA-E', 'VIVARO-E', 'ZAFIRA-E'],
    "PEUGEOT": ['E-208', 'E-2008', 'E-RIFTER', 'E-TRAVELLER'],
    "POLESTAR": ['2', '3'],
    "PORSCHE": ['TAYCAN'],
    "RENAULT": ['MASTER', 'TWINGO', 'KANGOO ELEC', 'ZOE'],
    "SKODA": ['ENYAQ'],
    "SMART": ['EQ'],
    "TESLA": [''],
    "TOYOTA": ['BZ'],
    "VOLKSWAGEN": ['ID.3', 'ID.4', 'ID.5', 'E-GOLF'],
    "VOLVO": ['C40', 'XC40']
}

def check_is_electric(model: str, trade_name: str, cylinders: int):
    """ Determine if the car is electric """
    if isinstance(cylinders, int) and cylinders > 0:
        return False
    for e_model, e_names in electric_cars.items():
        if model == e_model:
            for e_name in e_names:
                if trade_name and (e_name in trade_name or e_name.replace('"', '') == trade_name):
                    return True
                if trade_name is None and len(e_name) == 0:
                    return True
    return False

In this code, I search for specific keywords; for example, if the model is "BMW", then the "I3" trade name will show us that this car is electric. And as an extra check (some cars may be electric or hybrid), I also analyze the number of cylinders which must be 0 or NULL for electric cars.

All 3 methods were already tested in the first part, and they work well, for example, I can easily apply a _modelnormalize method to the Pandas dataset using one line of code:

df["Model"] = df['Model'].map(lambda s: model_normalize(s))

But how can we use it with SQL? Well, directly we can't, but we can do this with the help of ORM.

SQLAlchemy ORM

ORM (Object Relational Mapping) is a technology, used for creating a "bridge" between the OOP language and relational databases. Practically, we can create a special Python class, and SQLALchemy will automatically transform all requests to this class into SQL. Which is pretty convenient, and allows developers to write code in pure Python, without ugly brackets and long SQL strings.

Let's create a "Car" class and put the needed methods there:

from sqlalchemy.orm import DeclarativeBase
from sqlalchemy import String, Integer, Date, Column

class Base(DeclarativeBase):
    pass

class Car(Base):
    __tablename__ = "rdw_cars"
    index = Column("index", Integer, primary_key=True) 
    license_plate = Column("License Plate", String)
    model = Column("Model", String) 
    trade_name = Column("Trade name", String) 
    num_cylinders = Column("Number of Cylinders", Integer) 
    first_registration = Column("First registration NL", Date) 
    price = Column("Catalog price", Integer) 
    is_electric = Column("Is electric", Integer)    

    def model_normalize(self):
        """ "PEUGEOT BOXER/GLOBE-TRAVE " => "PEUGEOT" """
        if self.model and isinstance(self.model, str) and len(self.model) > 0:
            self.model = self.model.replace("-", " ").replace("/", " ").split()[0].upper().strip()

    def name_normalize(self):
        """ Remove duplicates from model and trade name: ("TESLA", "TESLA MODEL 3") => ("TESLA", "MODEL 3") """
        if isinstance(self.trade_name, str) and len(self.trade_name) > 0:
            name = self.trade_name.upper().strip()
            if name.split()[0] == self.model:
                # "TESLA MODEL 3" => [TESLA, MODEL, 3] => "MODEL 3"
                self.trade_name = ' '.join(name.split()[1:])  
            else:
                self.trade_name = name

    def check_electric(self):
        self.is_electric = check_is_electric(self.model, self.trade_name, self.num_cylinders)

As an example of this approach, let's display the most expensive cars of the specific model in the Netherlands. With a standard SQL, we can make a request like this:

with Session(rdw_db) as session:
    model = "BMW"
    limit = 5
    df = pd.read_sql_query(text(f'SELECT "Model", "Trade name", "Catalog price", "First registration NL" FROM rdw_cars WHERE Model = "{model}" ORDER BY "Catalog price" DESC LIMIT {limit}'), 
                           con=session.connection())
    display(df.style.hide(axis="index"))

As a result, we get the table:

It works, but the SQL string is pretty long, and I had to use the f-string to add the needed variables to the request. With the help of Object Relational Mapping, I can just use standard Python code:

with Session(rdw_db) as session:
    model = "BMW"
    limit = 5
    df = pd.read_sql_query(select(Car.model, Car.trade_name, Car.price, Car.first_registration).filter(Car.model == model).order_by(Car.price.desc()).limit(limit), 
                           con=session.connection())
    display(df.style.hide(axis="index"))

SQLAlchemy will create a proper SQL request "under the hood", and as a result, we have an easier-to-read Python code. The major drawback, though, is that SQL is more or less standard; there are many resources and tutorials about it, but the SQLAlchemy code is specific only to this library. But for our task, this approach works well.

Using ORM, we can easily apply our "check_electric" method to all the records in the database:

with Session(rdw_db) as session:
    cars_total = session.query(Car).count()

    index = 0
    batch_size = 25000
    while True:
        pos1, pos2 = index*batch_size, (index + 1)*batch_size
        if index % 20 == 0:
            print(f"Processing {pos1} to {pos2}, {100*index*batch_size//cars_total}%...")

        cars = session.query(Car).filter(Car.index.between(pos1, pos2)).all()
        if len(cars) == 0:
            break
        for car in cars:
            car.model_normalize()
            car.name_normalize()
            car.check_electric()

        session.flush()
        index += 1

    session.commit()

In this code, I read records from the database, update the parameters, and save data back to the table. SQLAlchemy will update the table using SQL requests, and it is definitely slower than updating Pandas Dataframe directly in memory. Calling the same methods in Pandas took only 130 seconds, and the SQLAlchemy requests took 390 seconds, so the difference is about 3 times. On the other side, for the batch update, we need much less memory, and there is no need to keep the whole dataframe in RAM.

Analysis

After updating the table, we're finally ready to go. As a warm-up, let's calculate the car prices' mean and percentiles.

Calculating the mean is easy, and can be done with SQLAlchemy in 1 line of code. Let's get the total amount of cars and their arithmetic price mean:

with Session(rdw_db) as session:
    c_total = session.query(Car).count()
    print(f"Cars total: {c_total}")
    c_el = session.query(Car).filter(Car.is_electric == 1).count()
    print(f"Cars electric: {c_el} ({100*c_el/c_total:.2f}%)")

    pm = session.query(func.avg(Car.price)).scalar()
    print("Price mean:", pm)

    pm_el = session.query(func.avg(Car.price)).filter(Car.is_electric == 1).scalar()
    print("Electric cars price mean:", pm_el)

Getting the percentiles is a bit more tricky, and we have two ways of doing that. We can load only the "prices" column, then it's possible to use NumPy "percentile" method to do the math:

    prices = session.query(Car.price).filter(Car.price != None).all()
    print("All cars percentiles [5, 50, 95]:", np.percentile(prices, [5, 50, 95]))

    prices_el = session.query(Car.price).filter((Car.price != None) & (Car.is_electric == 1)).all()
    print("Electric cars percentiles [5, 50, 95]:", np.percentile(prices_el, [5, 50, 95]))

If the dataset is large and we want to avoid loading the data at all, we can get the percentile using pure SQL by combining "order_by", "limit" and "offset":

    num_total = session.query(Car).filter(Car.price != None).count()
    p5 = session.query(Car.price).filter(Car.price != None).order_by(Car.price).offset(num_total*5/100 - 1).limit(1).scalar() 
    p50 = session.query(Car.price).filter(Car.price != None).order_by(Car.price).offset(num_total*50/100 - 1).limit(1).scalar() 
    p95 = session.query(Car.price).filter(Car.price != None).order_by(Car.price).offset(num_total*95/100 - 1).limit(1).scalar() 
    print("All cars percentiles [5, 50, 95]:", p5, p50, p95)

    num_el = session.query(Car).filter((Car.price != None) & (Car.is_electric == 1)).count()
    p5 = session.query(Car.price).filter((Car.price != None) & (Car.is_electric == 1)).order_by(Car.price).offset(num_el*5/100 - 1).limit(1).scalar() 
    p50 = session.query(Car.price).filter((Car.price != None) & (Car.is_electric == 1)).order_by(Car.price).offset(num_el*50/100 - 1).limit(1).scalar() 
    p95 = session.query(Car.price).filter((Car.price != None) & (Car.is_electric == 1)).order_by(Car.price).offset(num_el*95/100 - 1).limit(1).scalar() 
    print("Electric cars percentiles [5, 50, 95]:", p5, p50, p95)

The results are interesting:

The Netherlands is a country with pretty high average salaries, but at the moment of writing this article, only 2,93% of cars are electric. The median price for all cars is €26,341, and electric cars are "on average" 2x more expensive; their median price is €49,975. The 95th percentile for all cars is €73,381, which means that 95% of the cars have a lower price. At the same time, 95% of electric cars have a price lower than €106,989.

Let's now find something more fun. Let's get the Top-20 electric cars in the Netherlands:

with Session(rdw_db) as session:        
    n_top = 20

    # Group by car model
    models_amout = session.query(Car.model, func.count(Car.model)).filter(Car.is_electric == 1).group_by(Car.model).order_by(desc(func.count(Car.model))).limit(n_top).all()[::-1]

    # Unzip array [('TESLA', 65896), ('VOLKSWAGEN', 28559)] to 2 parts
    models, amount = zip(*models_amout)

    # Show
    p = figure(y_range=models, width=1200, height=500, title="Top-%d electric car manufacturers in the Netherlands (data 2023)" % n_top)
    p.hbar(right=amount, y=models, height=0.8, color=Viridis256[:n_top])
    p.xgrid.grid_line_color = None
    p.x_range.start = 0
    p.below[0].formatter.use_scientific = False
    p.xaxis.axis_label = "Cars total"
    show(p)

As we can see, Tesla has the first place, with more than 55,000 cars registered in the country:

Top-20 electric car manufacturers, Image by author

I was curious about which Tesla model is the most popular. To know this, we can change the request:

    models_amout = session.query(Car.trade_name, func.count(Car.trade_name)).filter(Car.model == "TESLA").group_by(Car.trade_name).order_by(desc(func.count(Car.trade_name))).order_by(Car.trade_name).all()[::-1]
    ...

It is clear that the "Model 3" is the most popular electric car at the moment of writing this article:

But we can also see that the dataset obviously needs more cleaning: some Tesla cars were registered as "MODEL 3", some as "MODEL3", some cars were saved as a "ROADSTER", and some as "RAODSTER", and so on.

Let's now group the electric car registrations per date. To make the graph more clear, I want to group dates by quarter, but the code for extracting quarters in SQL can be bulky. Instead, I will group registrations per day using SQL, then I can calculate quarters using the internal Pandas function:

with Session(rdw_db) as session:
    regs_amount = session.query(Car.first_registration, func.count(Car.first_registration)).filter(Car.is_electric == 1).group_by(Car.first_registration).order_by(Car.first_registration).all()

    df = pd.DataFrame(regs_amount, columns =['First registration NL', 'Amount'])
    df["First registration NL"] = df['First registration NL'].map(lambda d: datetime.datetime(d.year, d.month, d.day))
    df["Quarter"] = df['First registration NL'].dt.to_period('Q')

    data_per_quarter = df.groupby(['Quarter'], as_index=False)["Amount"].sum()
    dates = data_per_quarter['Quarter']
    amount = data_per_quarter['Amount']

    p = figure(x_axis_type='datetime', width=1600, height=500, 
               title=f"Electric car registrations in the Netherlands, 1992-2022")
    p.vbar(x=dates, top=amount, width=datetime.timedelta(days=3*22), line_color='black')
    p.xaxis[0].ticker.desired_num_ticks = 20
    p.yaxis.axis_label = "Cars total"
    show(p)

In this code, I first converted the SQL result to the Pandas dataframe; then I converted Python "date" objects to "datetime" (for some reason, the quarter calculation works with "datetime" only). The code is almost similar to the first part, but here I use Pandas "groupby.sum()" instead of "size()", because the data retrieved from SQL was already grouped by days.

The result is interesting:

Electric car registrations, Image by author

As was described in the first part, the first electric car was registered in the Netherlands in 1992. It was a Fiat Panda Elettra, a small two-seat car with a 70 km/h maximum speed, 100 km range, and 12 lead-acid 6V batteries as a power source. It was the only electric car in the whole country for 15 years; the 3 next Tesla Roadster cars were registered only in 2009.

Let's now get a price distribution of electric cars. I want to draw a box and whisker plot, and for that, I need to know the minimum, maximum, and quartile values per model:

with Session(rdw_db) as session:
    request_models = session.query(Car.model).filter(Car.is_electric == 1).group_by(Car.model).all()

    def q0(x):
        return x.quantile(0.01)

    def q1(x):
        return x.quantile(0.25)

    def q3(x):
        return x.quantile(0.75)

    def q4(x):
        return x.quantile(0.99)

    models_data = {}
    for m in request_models:
        model_name = m[0]  # (AIWAYS,) => AIWAYS
        print("Processing", model_name)
        request_model = session.query(Car.price).filter((Car.is_electric == 1) & (Car.price > 0) & (Car.model == model_name)).all()
        df = pd.DataFrame(request_model)
        agg_data = {'price': ['size', 'min', q0, q1, 'median', q3, q4, 'max']}        
        models_data[model_name] = df.agg(agg_data)["price"]

    df = pd.concat(models_data, axis=1).transpose()
    display(df)

In this code, first I get the list of all car models; then I get the prices for each model and aggregate those prices using Pandas. Then I combine the data into a single dataframe. The result looks like this:

Having this dataframe, it is easy to draw a box plot:

    # Sort models by price
    df = df.sort_values(by='median', ascending=True)

    models = df.index.values    
    v_min = df["q0"].values
    q1 = df["q1"].values
    q3 = df["q3"].values
    v_max = df["q4"].values

    # Draw
    palette = (Inferno10 + Magma10 + Plasma10 + Viridis10)[:models.shape[0]]
    source = ColumnDataSource(data=dict(models=models, 
                                        bottom=q1, 
                                        top=q3, 
                                        color=palette, 
                                        lower=v_min, 
                                        upper=v_max))

    p = figure(x_range=models, width=1900, height=500, title="Electric car prices distribution in the Netherlands")
    whisker = Whisker(base="models", upper="upper", lower="lower", source=source)
    p.add_layout(whisker)
    p.vbar(x='models', top='top', bottom='bottom', width=0.9, color='color', line_color="black", source=source)
    p.left[0].formatter.use_scientific = False
    p.y_range.start = 0
    show(p)

The result looks like this:

Electric car manufacturers and prices box plot, Image by author

With the help of SQLAlchemy, it is also easy to get all prices of electric cars and build the histogram using the "np.histogram" method. The code is almost the same as in the first part; those who wish can do it on their own.

Conclusion

Analyzing the data from the real dataset is interesting, and it turned out that SQL and Pandas work together pretty well. The "heavy lifting" of retrieving and pre-processing the data can be done using SQLAlchemy, then this data can be used in Pandas.

As for the processing of the data itself, a lot of work can be done. It may be interesting to combine this data with the Kaggle Electric Vehicles dataset and find the correlation between the maximum distance range, price, and car release date (newer models should have a longer range). I tried to do this, but the car model names in both datasets do not match, and I am not an expert in cars to do this manually for every model. Also, as was shown before, the RDW dataset needs more cleaning, and not all the names are consistent. Readers, who are interested, can continue these experiments on their own.

If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors.

Thanks for reading.

Tags: Cars Data Visualization Programming Python Sql