Electric Cars In the Netherlands: Exploratory Data Analysis with Python

Author:Murphy  |  View: 25237  |  Time: 2025-03-23 19:48:43
Smart EQ Car, Image Source https://en.wikipedia.org/wiki/Smart_electric_drive

When was the first electric car registered? (Spoiler: it was much earlier than most people may think.) Which Cars are more expensive, the electric Porcshe or Jaguar? Exploratory data analysis (EDA) is not only an important part of building every data pipeline, but it is also a pretty interesting process. In this article, I will use the Dutch RDW (Netherlands Vehicle Authority) public dataset to find information about electric cars. We will see which data can be extracted and displayed with Python, Pandas, and Bokeh.

Let's get started.

Loading the data

The RDW ("Rijks Dienst Wegverkeer", https://www.rdw.nl) is a Dutch organization that handles approvals and registration of motorized vehicles and driving licenses in the Netherlands. As a public governmental institution, it has its data available to everyone. Most interesting for us is the "Gekentekende voertuigen" ("Vehicles with license plates") dataset. It is available for free under a Public Domain license and can be downloaded from opendata.rdw.nl. The file size is about 10 GB; it contains information about all vehicles, registered in the Netherlands since 1952. Processing a file of such a size can also be a challenge – which makes the task more interesting.

I will use Jupyter Lab, in this case, it is more convenient than using a standard IDE because reloading the 10 GB file every time a project is starting not looks like a good idea. Also, I will use Pandas for processing and Bokeh for visualization. First, let's import the needed libraries:

import os
import pandas as pd
import numpy as np
import datetime

from bokeh.io import show, output_notebook, export_png
from bokeh.plotting import figure, output_file
from bokeh.models import ColumnDataSource, LabelSet, Whisker
from bokeh.palettes import *
output_notebook()

Now we are ready to load the dataset. Let's try a "naive" approach first:

filename = "Open_Data_RDW__Gekentekende_voertuigen.csv"
df = pd.read_csv(filename)
display(df)

After running this code, the PC is freezing for about 30 seconds… and the Python kernel is crushing. Oops. It's not only loading slowly, but we also don't have enough memory. At least on my computer, 32 GB of RAM was not enough for that task.

Well, if we cannot load the file in memory, we can read it line by line; this approach is well known since the time of IBM mainframes and tape drives. Let's read the first lines of the file and see what is inside:

filename = "Open_Data_RDW__Gekentekende_voertuigen.csv"
with open(filename, 'r') as f:
    header_str = f.readline()
    print(header_str)
    for _ in range(10):
        print(f.readline())

The result looks like this:

Kenteken,Voertuigsoort,Merk,Handelsbenaming,Vervaldatum APK,Datum tenaamstelling,Bruto BPM,Inrichting,Aantal zitplaatsen,Eerste kleur,Tweede kleur,Aantal cilinders,Cilinderinhoud,Massa ledig voertuig,Toegestane maximum massa voertuig,Massa rijklaar,Maximum massa trekken ongeremd,Maximum trekken massa geremd,Datum eerste toelating,Datum eerste tenaamstelling in Nederland,Wacht op keuren,Catalogusprijs,WAM verzekerd,Maximale constructiesnelheid,Laadvermogen,Oplegger geremd,Aanhangwagen autonoom geremd,Aanhangwagen middenas geremd,Aantal staanplaatsen,Aantal deuren,Aantal wielen,Afstand hart koppeling tot achterzijde voertuig,Afstand voorzijde voertuig tot hart koppeling,Afwijkende maximum snelheid,Lengte,Breedte,Europese voertuigcategorie,Europese voertuigcategorie toevoeging,Europese uitvoeringcategorie toevoeging,Plaats chassisnummer,Technische max. massa voertuig,Type,Type gasinstallatie,Typegoedkeuringsnummer,Variant,Uitvoering,Volgnummer wijziging EU typegoedkeuring,Vermogen massarijklaar,Wielbasis,Export indicator,Openstaande terugroepactie indicator,Vervaldatum tachograaf,Taxi indicator,Maximum massa samenstelling,Aantal rolstoelplaatsen,Maximum ondersteunende snelheid,Jaar laatste registratie tellerstand,Tellerstandoordeel,Code toelichting tellerstandoordeel,Tenaamstellen mogelijk,Vervaldatum APK DT,Datum tenaamstelling DT,Datum eerste toelating DT,Datum eerste tenaamstelling in Nederland DT,Vervaldatum tachograaf DT,Maximum last onder de vooras(sen) (tezamen)/koppeling,Type remsysteem voertuig code,Rupsonderstelconfiguratiecode,Wielbasis voertuig minimum,Wielbasis voertuig maximum,Lengte voertuig minimum,Lengte voertuig maximum,Breedte voertuig minimum,Breedte voertuig maximum,Hoogte voertuig,Hoogte voertuig minimum,Hoogte voertuig maximum,Massa bedrijfsklaar minimaal,Massa bedrijfsklaar maximaal,Technisch toelaatbaar massa koppelpunt,Maximum massa technisch maximaal,Maximum massa technisch minimaal,Subcategorie Nederland,Verticale belasting koppelpunt getrokken voertuig,Zuinigheidsclassificatie,Registratie datum goedkeuring (afschrijvingsmoment BPM),Registratie datum goedkeuring (afschrijvingsmoment BPM) DT,API Gekentekende_voertuigen_assen,API Gekentekende_voertuigen_brandstof,API Gekentekende_voertuigen_carrosserie,API Gekentekende_voertuigen_carrosserie_specifiek,API Gekentekende_voertuigen_voertuigklasse
85XXXA,Personenauto,VOLKSWAGEN,CALIFORNIA,20230702,20220915,10437,kampeerwagen,,GROEN,Niet geregistreerd,5,2461,2088,2800,2188,700,2000,20010626,20010626,Geen verstrekking in Open Data,,Ja,,,,,,,0,4,0,0,,0,0,M1,,,r. in watergoot v. voorruit,2800,,,e1*96/79*0066*10,AJTCKX0,N1P00J2SGFM52B010U,1,0.03,292,Nee,Nee,,Nee,4500,0,0.00,2022,Logisch,00,Ja,07/02/2023 12:00:00 AM,09/15/2022 12:00:00 AM,06/26/2001 12:00:00 AM,06/26/2001 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,https://opendata.rdw.nl/resource/3huj-srit.json,https://opendata.rdw.nl/resource/8ys7-d773.json,https://opendata.rdw.nl/resource/vezc-m2t6.json,https://opendata.rdw.nl/resource/jhie-znh9.json,https://opendata.rdw.nl/resource/kmfi-hrps.json
85XXXB,Personenauto,PEUGEOT,3*RFN*,20230920,20210224,5162,hatchback,5,ZWART,Niet geregistreerd,4,1997,1194,1719,1294,625,1300,20010720,20010720,Geen verstrekking in Open Data,,Ja,,,,,,,4,4,0,0,,420,0,M1,,,op r. schroefveerkoker onder motorkap,1719,,,e2*98/14*0244*00,C,B,0,0.08,261,Nee,Nee,,Nee,3019,0,,2022,Logisch,00,Ja,09/20/2023 12:00:00 AM,02/24/2021 12:00:00 AM,07/20/2001 12:00:00 AM,07/20/2001 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,D,,,https://opendata.rdw.nl/resource/3huj-srit.json,https://opendata.rdw.nl/resource/8ys7-d773.json,https://opendata.rdw.nl/resource/vezc-m2t6.json,https://opendata.rdw.nl/resource/jhie-znh9.json,https://opendata.rdw.nl/resource/kmfi-hrps.json
...
85XXXN,Personenauto,NISSAN,NISSAN MURANO,20240106,20111126,18921,stationwagen,5,ZWART,Niet geregistreerd,6,3498,1833,2380,1933,750,1585,20081206,20081206,Geen verstrekking in Open Data,,Ja,,,,,,,4,4,0,0,,484,0,M1,,,r. voorzitting by dwarsbalk,2380,Z51,,e1*2001/116*0478*00,A,A01,0,0.1,283,Nee,Nee,,Nee,3965,0,,2023,Logisch,00,Ja,01/06/2024 12:00:00 AM,11/26/2011 12:00:00 AM,12/06/2008 12:00:00 AM,12/06/2008 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,E,,,https://opendata.rdw.nl/resource/3huj-srit.json,https://opendata.rdw.nl/resource/8ys7-d773.json,https://opendata.rdw.nl/resource/vezc-m2t6.json,https://opendata.rdw.nl/resource/jhie-znh9.json,https://opendata.rdw.nl/resource/kmfi-hrps.json

As we can see, there are plenty of different data fields, and we actually don't need all of them. About every car, I want to know only its type, license plate, model name, price, and registration date. This database is old enough, and there is no field indicating if the car is electric or not. But at least, there is a field, containing the "Number of cylinders", which can help us to exclude the cars which are not electric.

Now we have only 7 fields to load, and in Pandas, we can specify the columns list, which drastically reduces the data size. The second trick is to specify the pd.UInt32Dtype to columns ‘Number of cylinders' and ‘Price'. I also want to see only "personal" cars ("Personenauto" in Dutch), and not trucks or buses:

cols = ['Kenteken', 'Voertuigsoort', 'Merk', 'Handelsbenaming', 'Aantal cilinders', 'Catalogusprijs']
cols_date = ['Datum eerste tenaamstelling in Nederland']

filename = "Open_Data_RDW__Gekentekende_voertuigen.csv"
df = pd.read_csv(filename, usecols=cols + cols_date, parse_dates=cols_date, 
                 dtype={"Catalogusprijs": pd.UInt32Dtype(), 
                        "Aantal cilinders": pd.UInt32Dtype()})
display(df)

df = df[df['Voertuigsoort'] == 'Personenauto']
df.info(memory_usage="deep")

Now the file was loaded correctly, and as the "info" method shows, the memory usage is 2.5 GB:

Dataset information, Image by author

But because of the large file size, data loading still takes a long time. The easiest way is to save the filtered dataset as a new file, and to use this file for further experiments:

df.to_csv("Open_Data_RDW__Gekentekende_voertuigen_short.csv", sep=',', 
          encoding='utf-8')

This file has only a 580 KB size, which is much smaller than the original 10 GB, and its loading does not cause any problems.

We also don't need a "Voertuigsoort" field anymore, and dropping this column will release some RAM and screen space. And as a last step, let's translate data fields to English from Dutch – it is not mandatory for analysis but will be more convenient to readers:

df = df.drop('Voertuigsoort', axis=1)

translations_dict_en_nl = {
                           'Kenteken': 'License plate', 
                           'Merk': 'Model', 
                           'Handelsbenaming': 'Trade name', 
                           'Aantal cilinders': 'Number of Cylinders', 
                           'Catalogusprijs': 'Catalog price',
                           'Datum eerste tenaamstelling in Nederland': 'First registration NL', 
                          }

df.rename(translations_dict_en_nl, axis='columns', inplace=True)

And now we are ready to go.

Basic analysis

In the beginning, let's see the main properties of the dataset, like data samples and dimensionality:

display(df)

display(df.shape[0])

display(df.isna().sum())

The display(df) method shows us the first and last rows of the dataset, so we can see what the data looks like. The second line shows us the total amount of records, which can be useful for calculations, and the last request will return the number of Null values per column.

The output looks like this:

Dataframe properties, Image by author

We have 9,487,265 records, where each car has a license plate, model, and registration date (these fields are probably mandatory for the registration), but the other fields, like "Trade name" or "Catalog price" are missing for some cars. Technically, we don't need any cleaning right now, but for some requests (like price distribution) we should remove Null values before making the request.

As an example of this approach, let's sort the data, to see the most expensive and the cheapest cars in the Netherlands:

df[df['Catalog price'].notna()].sort_values(by=['Catalog price'], 
                                            ascending=False)
Dataframe sorted by price, Image by author

The result is interesting. In first place is "PEUGEOT 5008" with a €9,700,305 price, which is weird because its price in Google is about €41,000 – maybe it is an error in the database, or the owner spent a lot of money for upgrades

Tags: Cars Data Science Data Visualization Hands On Tutorials Programming

Comment