My (Very) Personal Data Warehouse

Author:Murphy | View: 25616 | Time: 2025-03-23 18:27:33

Wearable fitness trackers have become an integral part of our lives, collecting and tracking data about our daily activities, sleep patterns, location, heart rate, and much more. I've been using a Fitbit device for 6 years to monitor my health. However, I have always found the data analysis capabilities lacking – especially when I wanted to track my progress against long term fitness goals. What insights are buried within my archive of personal fitness activity data? To start exploring I needed a good approach for performing data analysis over thousands of poorly documented JSON and CSV files … extra points for analysis that doesn't require my data to leave my laptop.

Enter DuckDB – a lightweight, free yet powerful analytical database designed to streamline Data Analysis workflows – that runs locally. In this blog post, I want to use DuckDB to explore my Fitbit data achieve and share the approach for analysing a variety of data formats and charting my health and fitness goals with the help of Seaborn data visualisations.

Export Fitbit data archive

Firstly, I needed to get hold of all of my historic fitness data. Fitbit make it fairly easy to export your Fitbit data for the lifetime of your account by following the instructions at export your account archive.

You'll need to confirm your request … and be patient. My archive took over three days to create – but I finally received an email with instructions to download a ZIP file containing my Fitbit data. This file should contain all my personal fitness activity recorded by my Fitbit or associated service. Unzipping the archive reveals a huge collection of files – mine for example had 7,921 files once I unzipped the 79MB file.

A small sample of the thousands of nested files – Screenshot by the author.

Let's start looking at the variety of data available in the archive.

Why DuckDB?

There are many great blogs (1,2,3) describing Duckdb – the TL;DR summary is DuckDB is an open-source in-process OLAP database built specifically for analytical queries. It runs locally, has extensive SQL support and can run queries directly on Pandas data, Parquet, JSON data. Extra points for its seamless integration with Python and R. The fact it's insanely fast and does (mostly) all processing in memory make it a good choice for building my personal data warehouse.

Fitbit activity data

The first collection of files I looked at was activity data. Physical Activity and broad exercise information appears to be stored in numbered files such as Physical Activity/exercise-1700.json

I couldn't work out what the file numbering actually meant, my guess is they are just increasing integers for a collection of exercise files. In my data export the earliest files started at 0 and went to file number 1700 over a 6 year period. Inside is an array of records, each with a description of an activity. The record seems to change depending on the activity – here is an example of a "walk"

"activityName" : "Walk",
  "averageHeartRate" : 79,
  "calories" : 122,
  "duration" : 1280000,
  "steps" : 1548,
  "startTime" : "01/06/23 01:08:57",
  "elevationGain" : 67.056,
  "hasGps" : false,
  : : : :
  "activityLevel" : [
    { "minutes" : 1, "name" : "sedentary"},
    { "minutes" : 2, "name" : "lightly"},
    { "minutes" : 6, "name" : "fairly"},
    { "minutes" : 6, "name" : "very"
  }]

This physical activity data is one file of the 7,921 files now on my laptop. Fortunately, DuckDB can read (and auto-detect the schema) from JSON files using read_json function, allowing me to load all of the exercise files into the physical_activity table using a single SQL statement. It's worth noting I needed to specify the date format mask as the Fitbit export has a very American style date format

Tags: Data Analysis Data Engineering Duckdb Fitbit Json