Expanding Time
While low-dimensional datasets may seem of limited use, there are often ways to extract more features from them – especially if the dataset includes time Data. Extracting additional features by "unpacking" a value for date and time can provide additional insights not readily available in the base dataset. This article will walk through the process of using Python to take low-dimensional weather data further than may be readily evident based on the data's original features.
Data
The data for this is public domain weather data used with permission of the Montana Climate Office at climate.umt.edu [1]. Montana's weather data is accessible at: https://shiny.cfc.umt.edu/mesonet-download/ [2]. For this article, the data uses daily air temperature recordings for two weather stations: Whitefish North, MT and Harding Cutoff, SD. Note that the station names and air temperature column have been slightly reformatted to improve understandability for this article.
Code and Data CSV:
The full notebook and data CSV with additional visualizations is available at the linked GitHub page: download or clone it from git to follow along.
This code requires the following libraries:
# Data Handling
import numpy as np
import pandas as pd
# Data visualization Libraries
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
pio.renderers.default='notebook'
1. Initial Data Exploration
Here's what the base dataframe looks like:
We can run a quick plotly visualization with the following code, which shows us temperature recording over time for both weather stations:
# Temperature Patterns - Weather Stations:
plot = px.scatter(df, x='datetime',
y='AirTempCelsius',
color='station_key')
plot.update_layout(
title={'text': "Temperature Recordings over Time
Whitefish N, MT and Harding Cutoff,
SD Weather Stations",
'xanchor': 'left',
'yanchor': 'top',
'x': 0.1},
xaxis_title='',
yaxis_title='Temperature in Celsius',
legend_title_text='Weather Station:')
plot.show()
This is a good start that shows the impact of seasonality on observed temperatures; the dips are winter months and the peaks are summer. Other possibilities include a histogram of observed temperatures:
# Generate plot:
plot = px.histogram(df, x='AirTempCelsius',
color='station_key', barmode='overlay')
plot.update_layout(
title={'text': "Number of Occurrences of Each Temperature (Celsius)
Whitefish N, MT and Harding Cutoff, SD
Weather Stations",
'xanchor': 'left',
'yanchor': 'top',
'x': 0.1},
xaxis_title='Temperature in Celsius',
yaxis_title='Count',
legend_title_text='Weather Station:')
plot.show()
More charts are possible, as is a basic statistical analysis on observed temperatures overall and by station. However, taking this data exploration to the next level requires additional feature extraction.
2. Expanding the Time Column
Let's revisit the "datetime" column. Looking at a random date example, this column contains date information in the following format: 2019–04–22, or year, month, day. There's quite a bit of data packed within this one column. Here's a visual on how it breaks out:
Splitting apart the "datetime" column can create 7 new columns in the dataframe, but there are more possibilities tailorable to the specific dataset and domain. For example, the day could break down into data on a weekend versus weekday or a holiday. Such information could be useful for a retail-focused dataset where an analyst is interested in customer behavior over time. Since the data used in this example is weather, season is of importance.
Accessing all of this is rather simple via Pandas' datetime capabilities. The first step is ensuring the "datetime" column is formatted as such with the following code:
# Set datetime column as pandas datetime data:
df['datetime'] = pd.to_datetime(df['datetime'])
There are a few ways to extract additional features from the "datetime" column:
- Directly via use of Pandas dt.
- By using Pandas dt to break the additional features into new columns.
2.1. Directly Accessing Elements from the Date and Time Column
Here's an example of directly accessing and converting the time data:
print("Dataframe 'datetime' Column Value:", df['datetime'][0])
print("Extracting Day Number:", df['datetime'].dt.day[0])
print("Extracting Day Name:", df['datetime'].dt.day_name()[0])
The output is:
Here's an example for months:
print("Dataframe 'datetime' Column Value:", df['datetime'][0])
print("Extracting Month Number:", df['datetime'].dt.month[0])
print("Extracting Month Name:", df['datetime'].dt.month_name()[0])
The output here is:
2.2. Creating New Columns for Elements Extracted from Date and Time Column
Another way to handle the time data is to extract specific datetime features into new columns. This expands the dimensionality of the dataframe. The following python code shows how to create new columns for the day name, the numerical day, the month name, the numerical month, and the year. The last line creates a column for the Julian Date, which is a sequential count of days in a year – the first day of the year being 001, the last 365 (or, in a leap year, 366). An example Julian calendar is available at this link [3].
# Day name column (example: Sunday):
df['DayName'] = df['datetime'].dt.day_name()
# Day number column:
df['Day'] = df['datetime'].dt.day
# Month name column (example: February):
df['MonthName'] = df['datetime'].dt.month_name()
# Month number column:
df['Month'] = df['datetime'].dt.month
# Year:
df['Year'] = df['datetime'].dt.year
# Julian date:
df['JulianDate'] = df['datetime'].dt.strftime('%j')
There's one thing left to do, and that's create a season column. This is a good example of where knowing the data, the customer, and domain knowledge relevant to the data are important. In the case of seasons, there are two calendar definitions: Meteorological and Astronomical [4]. Meteorologists define each season as a three month span beginning on the 1st of a month, whereas the Astronomical Seasons begin on calendar dates that do not coincide with the start of the month.
In this example, analyzing weather data mandates use of the Meteorological calendar. However, suppose the dataset in question deals with consumer purchases over time. In that case, Astronomical Seasons might make more sense. Here's an example of an if-else list comprehension for creating a season column based on the Meteorological calendar:
# Classify seasons:
df['Season'] = ['Winter' if x == 'December' else
'Winter' if x == 'January' else
'Winter' if x == 'February' else
'Spring' if x == 'March' else
'Spring' if x == 'April' else
'Spring' if x == 'May' else
'Summer' if x == 'June' else
'Summer' if x == 'July' else
'Summer' if x == 'August' else
'Fall' if x == 'September' else
'Fall' if x == 'October' else
'Fall' if x == 'November' else
'NaN' for x in df['MonthName']]
This results in the following dataframe:
While 10 columns may still qualify as low-dimensionality for a dataframe, it is quite a change from the starting point of 3. But what do these new time features enable? The next section shows some possibilities.
3. Making Use of the New Time Features
Let's revisit the original plot of temperature over time from section 1. Here's an example code block visualizing one station (Whitefish North) over time with seasons represented by different colors:
# Show one station with seasons plotted:
plot = px.scatter(df[df['station_key'] == 'Whitefish N'],
x='datetime', y='AirTempCelsius', color='Season',
color_discrete_sequence=["#3366cc", "#109618", "#d62728",
"#ff9900"])
plot.update_layout(
title={'text': "Temperature Patterns by Season
Data from Whitefish N, MT Weather Station",
'xanchor': 'left',
'yanchor': 'top',
'x': 0.1},
xaxis_title='',
yaxis_title='Temperature in Celsius',
legend_title_text='Season:')
plot.show()
The resulting chart is:
The new time features quickly show how seasons map to shifts in observed temperature. The addition of the seasons column is already proving useful, but revisiting the histogram from section 1 is even more interesting. The updated code below assigns the season as the value for facet_row in the plotly express histogram:
# Generate plot:
plot = px.histogram(df, x='AirTempCelsius', color='station_key',
barmode='overlay', facet_row='Season')
plot.update_layout(title={'text': "Temperature Recordings, 2019 to 2022
Whitefish N, MT and Harding
Cutoff, SD Weather Stations",
'xanchor': 'left',
'yanchor': 'top',
'x': 0.1}, legend_title_text='Month',
xaxis_title='Recorded Temperature')
plot.update_layout(legend_title_text='Weather Station:')
plot.update_yaxes(title="")
plot.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
plot.show()
The result is:
Introducing new time features to the dataframe enhances the ability to compare the temperature distributions of the two weather stations over time. In this case, a noticeable divergence between the two stations occurs in Summer.
Here's another example – suppose a meteorologist was interested in comparing the two weather stations in Summer 2020 by Julian date. Here's how to visualize the temperature recordings using the new features:
# Prep Data:
df1 = df[df['Year'] == 2020]
df1.sort_values(by=['JulianDate'], inplace=True)
# Generate plot:
plot = px.line(df1[df1['Season'] == 'Summer'],
y="AirTempCelsius", x="JulianDate", color="station_key",
color_discrete_sequence=["#3366cc", "#d62728"])
plot.update_layout(title={'text': "Summer Temperature Recordings, 2020
Whitefish N, MT and Harding
Cutoff, SD Weather Stations",
'xanchor': 'left',
'yanchor': 'top',
'x': 0.1}, legend_title_text='Month',
xaxis_title='Julian Date',
yaxis_title='Temperature in Degrees Celsius')
plot.update_layout(legend_title_text='Weather Station:')
plot.show()
The graph looks like this:
The additional time features allow analysts to quickly answer narrowly scoped questions; note how the Harding Cutoff station's Summer 2020 temperatures were typically higher higher than Whitefish N until an anomalous crossover occurred in the latter half of the season.
3.1. Directly Using the Original Date and Time Column
Recall in section 2 we discussed directly accessing additional date and time features versus extracting them into new columns. The above graph, "Summer Temperature Recordings, 2020," is reproducible by using Pandas dt functions on the original "datetime" column inside the plotly chart code:
# Generate Plot:
plot = px.line(df[(df.datetime.dt.year == 2020) &
((df.datetime.dt.month == 6) |
(df.datetime.dt.month == 7) |
(df.datetime.dt.month == 8))],
x=df[(df.datetime.dt.year == 2020) &
((df.datetime.dt.month == 6) |
(df.datetime.dt.month == 7) |
(df.datetime.dt.month == 8))].datetime.dt.strftime('%j'),
y=df[(df.datetime.dt.year == 2020) &
((df.datetime.dt.month == 6) |
(df.datetime.dt.month == 7) |
(df.datetime.dt.month == 8))].AirTempCelsius,
color="station_key",
color_discrete_sequence=["#d62728", "#3366cc"])
plot.update_layout(title={'text': "Summer Temperature Recordings, 2020
Whitefish N, MT and Harding
Cutoff, SD Weather Stations",
'xanchor': 'left',
'yanchor': 'top',
'x': 0.1}, legend_title_text='Month',
xaxis_title='Julian Date',
yaxis_title='Temperature in Degrees Celsius')
plot.update_layout(legend_title_text='Weather Station:')
plot.show()
This results in the same exact graph:
The advantage of this technique is it does not require increasing the dataframe's dimensionality which, for very large datasets, can help reduce computational load and prevent the dataframe frame from becoming too large to deal with. This technique also works well for a narrow analysis question requiring limited use of additional time features.
The disadvantage is the large amount of code required to format and extract specific features within a visualization function. This can hinder the interpretability and repeatability of the code. There may also be functions or code that are incompatible with in-line Pandas dt operations.
4. Conclusion
Time columns in dataframes often contain numerous latent features that can improve the analytical and visual output possibilities for a low-dimensional data. Extracting these time features can increase a dataframe's dimensionality in a way that unlocks new, useful possibilities for analysis.
For more example visualizations and the complete code, the Jupyter notebook and csv file is available at this linked Github page.
References:
[1] University of Montana, Montana Climate Office (2023).
[2] Montana Mesonet Data, Montana Mesonet Data Downloader (2023).
[3] NOAA Great Lakes Environmental Research Laboratory – Ann Arbor, MI, USA, Julian Date Calendar (2023).
[4] NOAA, Meteorological Versus Astronomical Seasons (2016).