Times Series for Climate Change: Forecasting Extreme Weather Events

Author:Murphy | View: 24485 | Time: 2025-03-23 18:37:59

This is Part 5 of the series Time Series for Climate Change. List of articles:

Extreme Weather Events

Climate change has contributed to an increasing number of extreme weather events. These events pose a significant risk to human lives and infrastructures.

Number of hail and thunderstorm wind events by year across the USA. Data source. Image by author.

Extreme weather events can result in human fatalities and cost up to billions of dollars. The financial impact is due to the destruction of infrastructure or agricultural resources. These events have a lasting impact on the socio-economic development of the affected region.

Estimated costs of various extreme weather events by year. Data source. Image by author.

Extreme weather events include several weather phenomena. These include hurricanes, tornadoes, floods, droughts, hail, and tropical storms. The figure below shows the 20 most common event types in the USA since 1950.

20 most common extreme weather event types across the USA since 1950. Data source. Image by author.

The need for accurate forecasts

Accurate forecasts of extreme weather events play a key role in our adaptation to the impacts of climate change.

Issuing timely alarms enables people to evacuate or protect themselves and their assets. Response teams can mobilize resources more efficiently, reducing the impact of the events.

Yet, forecasting extreme weather events is a difficult task.

One problem concerns the access to data. Extreme weather events are the result of the interactions of many factors about the atmosphere, the ocean, and land. These factors can be difficult to model, or may not be readily available. Besides, extreme weather events are rare by definition. Machine learning models tend to struggle with data sets with an imbalanced distribution – that is, when one of the classes is rare.

Another difficulty arises because weather conditions can change rapidly. This means there's a lot of uncertainty about the future. Forecasting models must convey uncertainty for effective communication. This aspect is important for policymakers and the public.

Tackling these challenges is a key step to mitigate the impacts of climate change.

Hands-on: Forecasting Extreme Weather Events in Florida

In the rest of this article, we'll build a model to forecast hail and thunderstorm events in Florida, USA. Hail represents ice precipitation caused by thunderstorms.

The complete code used in this tutorial is available on GitHub:

https://github.com/vcerqueira/tsa4climate

Data set

We use a data set collected by NOAA that describes storm events that occurred in the USA since 1950 [1]. The data includes information such as:

event type (e.g. hail, tornado);
location (coordinates and respective state);
date and time;
estimated costs.

Here's a sample of the data about storm events in Florida:

Sample of the storm events data set. Image by author.

Storm detection and tracking are usually done with approaches based on remote sensing. But, we can also model extreme weather events based on meteorological data. We'll use meteorological data captured by smart buoys to predict impending hail or thunderstorm events. This data is also collected by NOAA [2].

Several smart buoys are placed on the coast of Florida. The screenshot was taken from here.

Reading the data

We can read the buoy data directly from NOAA as follows:

import re

import urllib3
import numpy as np
import pandas as pd

# Subset of stations on the coast of Florida
STATION_LIST = ['41009', 'SPGF1', 'VENF1',
                '42036', 'SAUF1', 'FWYF1',
                'LONF1', 'SMKF1']

# URL template
URL = 'https://www.ndbc.noaa.gov/view_text_file.php?filename={station_id}h{year}.txt.gz&dir=data/historical/stdmet/'

def read_buoy_remote(station_id: str, year: int):
    TIME_COLUMNS = ['YYYY', 'MM', 'DD', 'hh']

    # formatting the URL
    file_url = URL.format(station_id=station_id.lower(), year=year)

    http = urllib3.PoolManager()

    # get request
    response = http.request('GET', file_url)

    # decoding
    lines = response.data.decode().split('n')

    # lots of data cleaning below
    data_list = []
    for line in lines:
        line = re.sub('s+', ' ', line).strip()
        if line == '':
            continue
        line_data = line.split(' ')
        data_list.append(line_data)

    df = pd.DataFrame(data_list[2:], columns=data_list[0]).astype(float)
    df[(df == 99.0) | (df == 999.0)] = np.nan

    if 'BAR' in df.columns:
        df = df.rename({'BAR': 'PRES'}, axis=1)

    if '#YY' in df.columns:
        df = df.rename({'#YY': 'YYYY'}, axis=1)

    if 'mm' in df.columns:
        TIME_COLUMNS += ['mm']

    df[TIME_COLUMNS] = df[TIME_COLUMNS].astype(int)

    if 'mm' in df.columns:
        df['datetime'] = 
            pd.to_datetime([f'{year}/{month}/{day} {hour}:{minute}'
                            for year, month, day, hour, minute in zip(df['YYYY'],
                                                                      df['MM'],
                                                                      df['DD'],
                                                                      df['hh'],
                                                                      df['mm'])])

    else:
        df['datetime'] = 
            pd.to_datetime([f'{year}/{month}/{day} {hour}:00'
                            for year, month, day, hour in zip(df['YYYY'],
                                                              df['MM'],
                                                              df['DD'],
                                                              df['hh'])])

    df = df.drop(TIME_COLUMNS, axis=1)

    df.set_index('datetime', inplace=True)

    return df

We collected the following variables from 8 buoys:

Wind speed;
Wave height;
Atmospheric pressure;
Water temperature;
Average wave period.

Here's a sample of this data from one of the buoys:

Sample of meteorological data from a smart buoy. Image by author.

Building the data set

We need to merge the two data sources to create a data set for training a forecasting model.

At each time step, we get the recent past values of meteorological data from the buoys. These are used as explanatory variables. The target variable is a binary variable indicating whether a hail or thunderstorm event occurs in the next 12 hours.

import numpy as np
import pandas as pd

from config import ASSETS, OUTPUTS

PART = 'Part 5'
assets = ASSETS[PART]

# focusing on hail and thunderstorm events
TARGET_EVENTS = ['Hail', 'Thunderstorm Wind']
# wind speed, wave height, pressure, water temp, avg wave period
METEOROLOGICAL_DATA = ['WSPD', 'WVHT', 'PRES', 'WTMP', 'APD']

# using past 4 hours as explanatory variables
N_LAGS = 4
# forecasting events in the next 12 hours
HORIZON = 12

# loading storm events data
storms = pd.read_csv(f'{assets}/storms_data.csv', index_col='storm_start')
storms.index = pd.to_datetime(storms.index)
hail_df = storms.loc[storms['EVENT_TYPE'].isin(TARGET_EVENTS), :]

# loading the meteorological data
buoys = pd.read_csv(f'{assets}/buoys.csv', index_col='datetime')
buoys.index = pd.to_datetime(buoys.index).tz_localize('UTC')
buoys['STATION'] = buoys['STATION'].astype(str)
# resampling the data to an hourly granularity
buoys_h = buoys.groupby('STATION').resample('H').mean()
# subsetting the variables
buoys_h = buoys_h[METEOROLOGICAL_DATA]
buoys_df = buoys_h.reset_index('STATION')

# getting all unique time steps
base_index = buoys_df.index.unique()

# getting list of stations
station_list = buoys_df['STATION'].unique().tolist()

X, y = [], []
# iterating over each time step
for i, dt in enumerate(base_index[N_LAGS + 1:]):

    features_by_station = []
    # iterating over each buoy station
    for station_id in station_list:

        # subsetting the data by station and time step (last n_lags observations)
        station_df = buoys_df.loc[buoys_df['STATION'] == station_id]
        station_df = station_df.drop('STATION', axis=1)
        station_df_i = station_df[:dt].tail(N_LAGS)

        if station_df_i.shape[0] < N_LAGS:
            break

        # transforming lags into features
        station_timestep_values = []
        for col in station_df_i:
            series = station_df_i[col]
            series.index = [f'{station_id}({series.name})-{i}'
                            for i in list(range(N_LAGS, 0, -1))]

            station_timestep_values.append(series)

        station_values = pd.concat(station_timestep_values, axis=0)

        features_by_station.append(station_values)

    if len(features_by_station) < 1:
        continue

    # combining features from all stations
    feature_set_i = pd.concat(features_by_station, axis=0)

    X.append(feature_set_i)

    # determining the target variable
    # whether an extreme weather events in the next HORIZON hours
    td = (hail_df.index - dt)
    td_hours = td / np.timedelta64(1, 'h')
    any_event_within = pd.Series(td_hours).between(0, HORIZON)

    y.append(any_event_within.any())

# combining all data points
X = pd.concat(X, axis=1).T
y = pd.Series(y).astype(int)

Building the model

From a machine learning perspective, the predictive task is a rare event binary classification problem. Besides, we need a probabilistic model to convey uncertainty effectively.

We'll use a LightGBM algorithm to build a model. Its parameters can be optimized using Optuna:

import optuna

import lightgbm as lgb
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

# objective function for optuna
def objective(trial, X, y):

    train_x, valid_x, train_y, valid_y = 
        train_test_split(X, y, test_size=0.2, shuffle=False)

    dtrain = lgb.Dataset(train_x, label=train_y)

    param = {
        'objective': 'binary',
        'metric': 'binary_logloss',
        'verbosity': -1,
        'boosting_type': 'gbdt',
        'linear_tree': True,
        'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True),
        'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True),
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),
        'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
    }

    gbm = lgb.train(param, dtrain)
    preds = gbm.predict(valid_x)

    # optimizing for AUC
    auc = roc_auc_score(valid_y, preds)

    return auc

def optimize_params(X, y, n_trials: int):
    func = lambda trial: objective(trial, X, y)

    # auc should be maximized
    study = optuna.create_study(direction='maximize')
    study.optimize(func, n_trials=n_trials)

    # getting best parameter setup
    trial = study.best_trial

    return trial.params

# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

# optimization
params = optimize_params(X_train, y_train, n_trials=100)

# retraining with best parameters
dtrain = lgb.Dataset(X_train, label=y_train)
gbm = lgb.train(params, dtrain)

# inference
preds = gbm.predict(X_test)

This leads to an AUC score of 0.78.

ROC curve of the model with an AUC of 0.78. Image by author.

The AUC score suggests that we can detect extreme weather events based on buoy data. There are a few things we could do to improve this model:

Better selection of buoys. I selected a few buoys close to the coast but far from each other. Yet, a different combination of buoys may be better;
Extra meteorological variables. We used 5 variables captured from each buoy, but others may provide value as well. For example, cloud cover information from satellites.

Key Takeaways

Extreme weather events pose a significant risk to human lives and infrastructure;
Climate change is associated with an increased number of extreme weather events. So, anticipating extreme weather events is important for our adaptation to climate change;
We can model the occurrence of an event based on meteorological data collected by smart buoys;
A probabilistic classifier based on a LightGBM can detect extreme weather events with a decent AUC score;
A better feature engineering process may help improve this model.

Thank you for reading, and see you in the next story!

References

[1] Storm Events Database, data retrieved from https://www.ncdc.noaa.gov/stormevents/ftp.jsp (License: Public domain)

[2] National Data Buoy Center, data retrieved from https://www.ndbc.noaa.gov/ (License: Public domain)

[3] McGovern, Amy, et al. "Using Artificial Intelligence to improve real-time decision-making for high-impact weather." Bulletin of the American Meteorological Society 98.10 (2017): 2073–2090.

[4] Ramachandra, Vikas. "Weather event severity prediction using buoy data and machine learning." arXiv preprint arXiv:1911.09001 (2019).

Tags: Artificial Intelligence Climate Change Data Science Editors Pick Time Series Forecasting