Solve the Mystery of the Serrated COVID Chart

Author:Murphy  |  View: 23427  |  Time: 2025-03-23 12:44:41
DreamShaper_v7_A_computer_monitor_displaying_a_chart_with_a_jagged_blue_line (by author & Leonardo AI)

In the first year of the Covid-19 pandemic, the mortality toll of the disease was the subject of much controversy. Among the issues were early underestimation due to a lack of testing, mortalities going unrecorded outside of hospitals, and distinguishing deaths of COVID-19 from deaths with COVID-19 [1][2].

On top of everything, and to everyone's great misfortune, the pandemic quickly became politicized. Partisan pundits leaped on every piece of data, looking for ways they could twist it to their advantage. Confirmation bias ran rampant. If you were on social media at the time, you probably saw posts that challenged the veracity of official charts and graphs.

In this Quick Success Data Science project, we'll look at a particular chart that showed up on my Facebook wall at the time. The chart records US COVID-19 mortalities for the first year of the pandemic, and it displays a distinctly serrated or "sawtooth" nature.

US COVID-19 mortalities for the first year of the pandemic (by author from "The COVID Tracking Project" at The Atlantic [3])

The curve oscillations have a high frequency, and it's doubtful that the disease progressed in this manner. While some considered this proof that COVID mortality counts were clearly wrong and could not be trusted, those of us blessed with data science skills quickly made short work of this overblown mystery.

The Dataset

The data we'll use was collected as part of "The COVID Tracking Project" at The Atlantic [3]. It includes COVID-19 statistics from March 3, 2020, to March 7, 2021. To reduce the size of the dataset, I've downloaded the data for just the state of Texas and saved it as a CSV file in this Gist.

You can find the original dataset [here](https://covidtracking.com/about-data/license) and the license for the data here.

Installing Libraries

Besides Python, we'll need the pandas library. You can install it using either:

conda install pandas

or

pip install pandas

A nice thing about pandas is that it ships with built-in functionality for making plots and working with time series, which are data points indexed in chronological order. Both Python and pandas treat dates and times as special objects that are "aware" of the mechanics of the Gregorian calendar, the sexagesimal (base 60) time system, time zones, daylight-saving time, leap years, and more.

Native Python supports time series through its [datetime](https://docs.python.org/3/library/datetime.html) module. Pandas' datetime capability is based on the NumPy datetime64 and timedelta64 data types. By converting "string" dates into "real" dates, we can do useful things like extract the days of the week or average the data over weeks or months.

The Code

The following code was written in JupyterLab and is described by cell.

Importing Libraries and Loading the Data

After importing pandas, we load the CSV file into a DataFrame, keeping only the columns for "date" and "mortalities." We then convert the "date" column to datetime using the pandas' to_datetime() method, sort it, and set it as the DataFrame's index.

import pandas as pd

df = pd.read_csv('https://bit.ly/3ZgrmW0', 
                 usecols=['date', 'mortalities'])
df.date = pd.to_datetime(df.date)
df = df.sort_values('date')
df = df.set_index('date')
df.tail()

Plotting the Initial Data

Pandas' plotting capability is limited but it's fine for data exploration and "quick look" analyses.

df.plot();

# Optional code to save the figure:
# fig = df.plot().get_figure();
# fig.savefig('file_name.png',  bbox_inches='tight', dpi=600)
The initial Texas dataset (image by the author)

This chart presents the same "choppiness" as the national data. It also includes a sharp spike around the end of July. Let's look at this spike before investigating the oscillations.

Handling the Spike

Because the spike is clearly a maximum value, we can easily retrieve its value and corresponding date index by using the max() and idxmax() methods, respectively.

print(f"Max. Value: {df['mortalities'].max()}")
print(f"Date: {df['mortalities'].idxmax()}")

This is most likely an anomalous value, especially given that the Center for Disease Control (CDC) records only 239 deaths on this date, which is more consistent with the adjacent data. Let's use the CDC value going forward. To change the DataFrame, we'll apply the .loc indexer and pass it the date and column name.

# Set aberrant spike at 2020-7-27 to CDC value of 239 deaths:
df.loc['2020-7-27', 'mortalities'] = 239  
df.plot();
The Texas data with the spike repaired (image by the author)

That looks better. Now let's evaluate the serrated nature of the curve. Visually, there appear to be about 4–5 oscillations every month, suggesting a weekly frequency.

Checking the Weekly Data

To investigate this in more detail, we'll plot a random subset of the data by weekday. We'll start by making a copy of our DataFrame, named "df_weekdays," and add a column for weekdays. We'll then plot about two weeks of data, over the (arbitrary) index range 103–120.

# Examine values by weekday:
df_weekdays = df.copy()
df_weekdays['weekdays'] = df.index.day_name()

df_weekdays.iloc[103:120].plot(figsize=(10, 5), rot=90, x='weekdays');
A subset of the Texas data plotted by weekday (image by the author)

The counts appear to drop during and after the weekends. Let's investigate this further using a tabular format. We'll look at a 3-week interval and highlight Mondays in the printout.

# Highlight Mondays in the DataFrame printout:
df_weekdays = df_weekdays.iloc[90:115]
df_weekdays.style.apply(lambda x: ['background: lightgrey' 
                                   if x.weekdays == 'Monday'
                                   else '' for i in x], axis=1)

The lowest reported number of deaths consistently occurs on a Monday, and the Sunday results also appear suppressed. This suggests a reporting issue over the weekend.

Indeed, this issue was confirmed by researchers at the Albert Einstein College of Medicine and Johns Hopkins School of Public Health [4]. They found that the oscillations only appear in datasets where the date of mortality reflects the reporting date. In datasets backdated to the episode date, the oscillations are absent.

Downsampling from Days to Weeks

Because of the reporting issues, the proper resolution for this data is weekly, rather than daily. In order to plot the data with a weekly interval, we need to downsample from a higher frequency to a lower frequency, using pandas' resample() method. Because multiple samples must be combined into one, the resample() method is usually chained to a method for aggregating the data, as listed in the following figure.

Useful aggregation methods in pandas (from "Python Tools for Scientists" [5])

Again, because the reporting issue affects daily counts but is corrected on a weekly basis, Downsampling the data from daily to weekly should merge the low and high reports and smooth the curve. We'll do this by passing the resample() method W and chaining the sum() aggregation method to sum the daily values. Besides W, other useful time series frequencies are shown in the table below.

Useful pandas time series frequencies (from "Python Tools for Scientists" [5])
# Resample weekly to remove serrations:
df.resample('W').sum().plot(grid=True);
The Texas data resampled to a weekly frequency (image by the author)

With the reporting bias "folded" into the new downsampled time series, the curve looks smoother, as we would expect.

Summary

In this project, we applied data science techniques to explain the strange, serrated nature of a historic COVID-19 mortality chart. These oscillations were used by some to question the veracity of the mortality data.

We found that the oscillations appeared to reflect the weekly tempo of reporting, pointing to biased practices in case reporting. Operational explanations like this should be considered prior to suggesting other mechanisms, such as the quality of hospital care over weekends or government tampering with reports.

We completed this project using pandas, Python's primary data analysis package. Pandas is ideal for "quick look" analyses like this. Besides its spreadsheet-like capabilities, it includes built-in tools for plotting and for working with dates.

The process we used was simple, but that's the point. It can take more effort to weave a conspiracy theory than to debunk one. My experiences in the corporate world were similar. We would sometimes argue for 3 weeks over whether to perform a task that would take 3 hours to complete if someone just sat down and did it!

Thanks!

Thanks for reading and follow me for more Quick Success Data Science projects in the future.

Citations

  1. Lang, Katherine, March 11, 2022, Are we overcounting COVID-19 deaths? (medicalnewstoday.com).
  2. Fichera, Angelo, April 2, 2021, Flawed Report Fuels Erroneous Claims About COVID-19 Death Toll – FactCheck.org
  3. The COVID Tracking Project at The Atlantic
  4. Bergman, A., Sella, Y., Agre, P., & Casadevall, A. (2020), "Oscillations in U.S. COVID-19 Incidence and Mortality Data Reflect Diagnostic and Reporting Factors," MSystems, 5(4), https://doi.org/10.1128/mSystems.00544-20
  5. Vaughan, Lee, 2023, Python Tools for Scientists: An Introduction to Using Anaconda, JuptyerLab, and Python's Scientific Libraries, No Starch Press, San Francisco.

Tags: Conspiracy Theories Covid-19 Data Literacy Downsampling Python Programming

Comment