Housing Rental Market in Germany: Exploratory Data Analysis with Python

Author:Murphy  |  View: 25934  |  Time: 2025-03-23 19:05:57
Salzbrücke, Germany, Image source https://en.wikipedia.org/wiki/German_Timber-Frame_Road

Germany is not only the largest economy in Europe, but it is also a country with beautiful landscapes and an interesting culture. Unsurprisingly, Germany is a popular destination for tourists and expats from all over the world. Exploratory data analysis of the German housing rental market can be interesting not only for data analysts but also for people who are going to live and work in this country. I will show some interesting trends that can be found with Python, Pandas, and Bokeh.

Let's get into it.

Collecting The Data

To find the data, I decided to use ImmoScout24, which is not only one of the biggest (about 72K apartments and houses are listed there at the time of writing this article), but also the oldest website of this type. According to https://web.archive.org, the first version was made in 1999, more than 20 years ago. ImmoScout24 also has an API and a page for developers. I contacted the PR department, and they gave me permission to use the website data for this publication but were not able to give me an API key. Probably this API is intended only for partners, to add or edit the housing data but not for batch reading. Well, it is not a problem; the data can be retrieved from web pages using Python, which makes the task even more challenging.

Before collecting the data in a similar way, please ask for the owner's permission first and also be "a good house guest": don't use too many threads to prevent server overload, use locally saved HTML files to debug your code, and in the case of using a web browser, disable image loading whenever possible.

First, I tried to get the page data with requests:

import requests

url_berlin = "https://www...."
print(requests.get(url_berlin))

Alas, it did not work – the page has protection against robots, and before getting the search results, the person must confirm that he or she is not a robot. Simple ways, like changing the "user-agent", did not help. Well, we are indeed not robots, and it is not a problem. The Selenium Python library allows using a real Chrome browser to retrieve the data and automate reading pages:

from selenium import webdriver
import time

def page_has_loaded(driver: webdriver.Chrome):
    """ Check if the page is ready """
    page_state = driver.execute_script('return document.readyState;')
    return page_state == 'complete'

def page_get(url: str, driver: webdriver.Chrome, delay_sec: int):
    """ Get the page content """
    driver.get(url)
    time.sleep(delay_sec)
    while not page_has_loaded(driver):
        time.sleep(1)
    return driver.page_source

options = webdriver.ChromeOptions()
driver = webdriver.Chrome(executable_path="./chromedriver", chrome_options=options)

# Get the first page
url_page1 = "https://www...."
html1 = page_get(url_page1, driver, delay_sec=30)

# Get next pages
url_page2 = "https://www..."
html2 = page_get(url_page1, driver, delay_sec=1)
...

When we run the code, the browser window will be opened. And as we can see in the code, before processing the first page, I added a 30-second delay, which is enough to confirm that I'm not a robot. During this interval, it is also good to open the browser "settings" by pressing the 3 dots at the right and to disable loading images; this makes processing much faster. The browser remains open during requests for the next pages, and further data can be processed without this "robot" check.

After we get the HTML body, the data extraction is more or less straightforward. First, we have to find HTML elements properties by using the "Inspect" button in the Web Browser:

HTML output in the browser, Image by author

Then we can get these elements in Python by using the Beautiful Soup library. This code extracts all apartment URLs from the page:

from bs4 import BeautifulSoup

soup = bs.BeautifulSoup(html1, "lxml")
li = soup.find(id="resultListItems")
links_all = []
children = li.find_all("li", {"class": "result-list__listing"})
for child in children:
    for link in child.find_all("a"):
        if 'data-go-to-expose-id' in link.attrs:
            links_all.append(base_url + link['href'])
            break
links_all.append(base_url + link['href'])

Let's now find out what kind of data we can get.

Data Fields

For every estate object, we can get a page like this (for privacy reasons, all values and company names are blurred):

A page example, Image by author

Let's see what kind of data we can get:

  • Title. In this picture, we can see (in German, of course) "A beautiful single-apartment in a beautiful (place) Hermsdorf". I don't think this text is useful for analysis, but just for fun, later we will build a cloud of words from it.
  • Typ (type). For this example, the type is "Etagenwohnung" (apartment, located on the floor).
  • Kaltmiete is a so-called "cold price". This is a rent price without utility costs, such as heating or electricity.
  • Warmmiete, or a "warm price". The name can be a bit misleading, as we can see in the picture, "warm price" contains not only heating costs ("heizkosten") but other extra costs ("nebenkosten") as well.
  • Etage (floor). On this page we can see a text "0 from 3" – a small amount of parsing will be required. In Germany, the 1st floor is the first elevated floor, so I suppose, that 0 means "ground floor", or "Erdgeschoss" in German. And from the "0 to 3" text we can also extract the total number of floors in the building.
  • Kaution (deposit). A value that can cover the possible damage, and will be returned to the tenant at the end of the rent. Here we can see a value of "3-Kaltmieten". Immediately, we keep in mind that some parsing will be required.
  • Flasche (area). As the name suggests, it's the area of the house or apartment.
  • Zimmer (room). In this example, it's 1.

Other data fields can also be extracted from the page, like extra rent for a garage or allowance for having pets, but for our task, these fields should be enough.

The HTML parsing process is, in general, the same as described before. For example, to get the property title, this code can be used:

soup = bs.BeautifulSoup(s_html, "lxml")

title = soup.find_all("h1", id="expose-title")
if len(title) > 0:
    str_title = title[0].get_text().strip()

Other fields can be found in the same way. After running the code for all pages, I got a dataset like this, which I saved in CVS format:

property_id;logging_date;property_area;num_rooms;floor;floors_in_building;price_cold_eur;price_warm_eur;deposit_eur;property_type;publisher;city;title;address;region;
13507XXX1;2023-03-20;7.0;1;None;None;110;110;None;Sonstige;Private;Berlin;Lagerraum / Kellerraum / Abstellraum zu vermieten;None;Moabit, 10551 Berlin;
13613XXX2;2023-03-20;29.0;1;None;None;189;320;None;None;XXXXXXXX Sverige AB;Berlin;Wohnungstausch: Luise-Zietz-Straße 119;Luise-Zietz-Straße 119;Marzahn, 12000 Berlin;
...
14010XXXn;2023-03-20;68.0;1;None;None;28000;28000;1000;None;HousingXXXXXXXXX B.V;Berlin;Wilhelminenhofstraße, Berlin;Wilhelminenhofstraße 0;Oberschöneweide, 12459 Berlin;

Let's now see what information we can get.

Data Transform and Load

As we saw in the last paragraph, the housing data will definitely require some cleaning and transformation.

I collected the data from 6 cities located in different parts of Germany: Berlin, Dresden, Frankfurt, Hamburg, Köln, and München. As an example, let's check Berlin; for other cities, the approach is the same. First, let's load the CSV into the Pandas dataframe:

import pandas as pd

df_berlin = pd.read_csv("Berlin.csv", sep=';', 
                        na_values=["None"], parse_dates=['logging_date'], 
                        dtype={"price_cold_eur": pd.Int32Dtype(), 
                               "price_warm_eur": pd.Int32Dtype(), 
                               "floor": pd.Int32Dtype(), 
                               "floors_in_building": pd.Int32Dtype()})
display(df_berlin)

The process is straightforward, but there are some useful tricks there. First, the parsing was done in Python, and for missing values, "None" was written in the CSV. I don't want to have "None" as a text string, so I specified it as a "na_values" parameter. I also specified ";" as a separator and set the "pd.Int32Dtype" type for integer fields, like price or floor number. By the way, my first attempt was to use UInt32, because the price cannot be negative anyway, but it turned out that when calculating the differences, negative values can sometimes occur, and this causes some cells to get values like 4,294,967,295. Practically, it was just easier to keep Int32; luckily for us, housing prices are not higher than the maximum Int32 value

Tags: Data Visualization Germany Hands On Tutorials Programming Python

Comment