Times Series for Climate Change: Forecasting Extreme Weather Events
This is Part 5 of the series Time Series for Climate Change. List of articles:
- Part 1: Forecasting Wind Power
- Part 2: Solar Irradiance Forecasting
- Part 3: Forecasting Large Ocean Waves
- Part 4: Forecasting Energy Demand
Extreme Weather Events
Climate change has contributed to an increasing number of extreme weather events. These events pose a significant risk to human lives and infrastructures.

Extreme weather events can result in human fatalities and cost up to billions of dollars. The financial impact is due to the destruction of infrastructure or agricultural resources. These events have a lasting impact on the socio-economic development of the affected region.

Extreme weather events include several weather phenomena. These include hurricanes, tornadoes, floods, droughts, hail, and tropical storms. The figure below shows the 20 most common event types in the USA since 1950.

The need for accurate forecasts
Accurate forecasts of extreme weather events play a key role in our adaptation to the impacts of climate change.
Issuing timely alarms enables people to evacuate or protect themselves and their assets. Response teams can mobilize resources more efficiently, reducing the impact of the events.
Yet, forecasting extreme weather events is a difficult task.
One problem concerns the access to data. Extreme weather events are the result of the interactions of many factors about the atmosphere, the ocean, and land. These factors can be difficult to model, or may not be readily available. Besides, extreme weather events are rare by definition. Machine learning models tend to struggle with data sets with an imbalanced distribution – that is, when one of the classes is rare.
Another difficulty arises because weather conditions can change rapidly. This means there's a lot of uncertainty about the future. Forecasting models must convey uncertainty for effective communication. This aspect is important for policymakers and the public.
Tackling these challenges is a key step to mitigate the impacts of climate change.
Hands-on: Forecasting Extreme Weather Events in Florida
In the rest of this article, we'll build a model to forecast hail and thunderstorm events in Florida, USA. Hail represents ice precipitation caused by thunderstorms.
The complete code used in this tutorial is available on GitHub:
Data set
We use a data set collected by NOAA that describes storm events that occurred in the USA since 1950 [1]. The data includes information such as:
- event type (e.g. hail, tornado);
- location (coordinates and respective state);
- date and time;
- estimated costs.
Here's a sample of the data about storm events in Florida:

Storm detection and tracking are usually done with approaches based on remote sensing. But, we can also model extreme weather events based on meteorological data. We'll use meteorological data captured by smart buoys to predict impending hail or thunderstorm events. This data is also collected by NOAA [2].

Reading the data
We can read the buoy data directly from NOAA as follows:
import re
import urllib3
import numpy as np
import pandas as pd
# Subset of stations on the coast of Florida
STATION_LIST = ['41009', 'SPGF1', 'VENF1',
'42036', 'SAUF1', 'FWYF1',
'LONF1', 'SMKF1']
# URL template
URL = 'https://www.ndbc.noaa.gov/view_text_file.php?filename={station_id}h{year}.txt.gz&dir=data/historical/stdmet/'
def read_buoy_remote(station_id: str, year: int):
TIME_COLUMNS = ['YYYY', 'MM', 'DD', 'hh']
# formatting the URL
file_url = URL.format(station_id=station_id.lower(), year=year)
http = urllib3.PoolManager()
# get request
response = http.request('GET', file_url)
# decoding
lines = response.data.decode().split('n')
# lots of data cleaning below
data_list = []
for line in lines:
line = re.sub('s+', ' ', line).strip()
if line == '':
continue
line_data = line.split(' ')
data_list.append(line_data)
df = pd.DataFrame(data_list[2:], columns=data_list[0]).astype(float)
df[(df == 99.0) | (df == 999.0)] = np.nan
if 'BAR' in df.columns:
df = df.rename({'BAR': 'PRES'}, axis=1)
if '#YY' in df.columns:
df = df.rename({'#YY': 'YYYY'}, axis=1)
if 'mm' in df.columns:
TIME_COLUMNS += ['mm']
df[TIME_COLUMNS] = df[TIME_COLUMNS].astype(int)
if 'mm' in df.columns:
df['datetime'] =
pd.to_datetime([f'{year}/{month}/{day} {hour}:{minute}'
for year, month, day, hour, minute in zip(df['YYYY'],
df['MM'],
df['DD'],
df['hh'],
df['mm'])])
else:
df['datetime'] =
pd.to_datetime([f'{year}/{month}/{day} {hour}:00'
for year, month, day, hour in zip(df['YYYY'],
df['MM'],
df['DD'],
df['hh'])])
df = df.drop(TIME_COLUMNS, axis=1)
df.set_index('datetime', inplace=True)
return df
We collected the following variables from 8 buoys:
- Wind speed;
- Wave height;
- Atmospheric pressure;
- Water temperature;
- Average wave period.
Here's a sample of this data from one of the buoys:

Building the data set
We need to merge the two data sources to create a data set for training a forecasting model.
At each time step, we get the recent past values of meteorological data from the buoys. These are used as explanatory variables. The target variable is a binary variable indicating whether a hail or thunderstorm event occurs in the next 12 hours.
import numpy as np
import pandas as pd
from config import ASSETS, OUTPUTS
PART = 'Part 5'
assets = ASSETS[PART]
# focusing on hail and thunderstorm events
TARGET_EVENTS = ['Hail', 'Thunderstorm Wind']
# wind speed, wave height, pressure, water temp, avg wave period
METEOROLOGICAL_DATA = ['WSPD', 'WVHT', 'PRES', 'WTMP', 'APD']
# using past 4 hours as explanatory variables
N_LAGS = 4
# forecasting events in the next 12 hours
HORIZON = 12
# loading storm events data
storms = pd.read_csv(f'{assets}/storms_data.csv', index_col='storm_start')
storms.index = pd.to_datetime(storms.index)
hail_df = storms.loc[storms['EVENT_TYPE'].isin(TARGET_EVENTS), :]
# loading the meteorological data
buoys = pd.read_csv(f'{assets}/buoys.csv', index_col='datetime')
buoys.index = pd.to_datetime(buoys.index).tz_localize('UTC')
buoys['STATION'] = buoys['STATION'].astype(str)
# resampling the data to an hourly granularity
buoys_h = buoys.groupby('STATION').resample('H').mean()
# subsetting the variables
buoys_h = buoys_h[METEOROLOGICAL_DATA]
buoys_df = buoys_h.reset_index('STATION')
# getting all unique time steps
base_index = buoys_df.index.unique()
# getting list of stations
station_list = buoys_df['STATION'].unique().tolist()
X, y = [], []
# iterating over each time step
for i, dt in enumerate(base_index[N_LAGS + 1:]):
features_by_station = []
# iterating over each buoy station
for station_id in station_list:
# subsetting the data by station and time step (last n_lags observations)
station_df = buoys_df.loc[buoys_df['STATION'] == station_id]
station_df = station_df.drop('STATION', axis=1)
station_df_i = station_df[:dt].tail(N_LAGS)
if station_df_i.shape[0] < N_LAGS:
break
# transforming lags into features
station_timestep_values = []
for col in station_df_i:
series = station_df_i[col]
series.index = [f'{station_id}({series.name})-{i}'
for i in list(range(N_LAGS, 0, -1))]
station_timestep_values.append(series)
station_values = pd.concat(station_timestep_values, axis=0)
features_by_station.append(station_values)
if len(features_by_station) < 1:
continue
# combining features from all stations
feature_set_i = pd.concat(features_by_station, axis=0)
X.append(feature_set_i)
# determining the target variable
# whether an extreme weather events in the next HORIZON hours
td = (hail_df.index - dt)
td_hours = td / np.timedelta64(1, 'h')
any_event_within = pd.Series(td_hours).between(0, HORIZON)
y.append(any_event_within.any())
# combining all data points
X = pd.concat(X, axis=1).T
y = pd.Series(y).astype(int)
Building the model
From a machine learning perspective, the predictive task is a rare event binary classification problem. Besides, we need a probabilistic model to convey uncertainty effectively.
We'll use a LightGBM algorithm to build a model. Its parameters can be optimized using Optuna:
import optuna
import lightgbm as lgb
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
# objective function for optuna
def objective(trial, X, y):
train_x, valid_x, train_y, valid_y =
train_test_split(X, y, test_size=0.2, shuffle=False)
dtrain = lgb.Dataset(train_x, label=train_y)
param = {
'objective': 'binary',
'metric': 'binary_logloss',
'verbosity': -1,
'boosting_type': 'gbdt',
'linear_tree': True,
'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True),
'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True),
'num_leaves': trial.suggest_int('num_leaves', 2, 256),
'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
}
gbm = lgb.train(param, dtrain)
preds = gbm.predict(valid_x)
# optimizing for AUC
auc = roc_auc_score(valid_y, preds)
return auc
def optimize_params(X, y, n_trials: int):
func = lambda trial: objective(trial, X, y)
# auc should be maximized
study = optuna.create_study(direction='maximize')
study.optimize(func, n_trials=n_trials)
# getting best parameter setup
trial = study.best_trial
return trial.params
# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
# optimization
params = optimize_params(X_train, y_train, n_trials=100)
# retraining with best parameters
dtrain = lgb.Dataset(X_train, label=y_train)
gbm = lgb.train(params, dtrain)
# inference
preds = gbm.predict(X_test)
This leads to an AUC score of 0.78.

The AUC score suggests that we can detect extreme weather events based on buoy data. There are a few things we could do to improve this model:
- Better selection of buoys. I selected a few buoys close to the coast but far from each other. Yet, a different combination of buoys may be better;
- Extra meteorological variables. We used 5 variables captured from each buoy, but others may provide value as well. For example, cloud cover information from satellites.
Key Takeaways
- Extreme weather events pose a significant risk to human lives and infrastructure;
- Climate change is associated with an increased number of extreme weather events. So, anticipating extreme weather events is important for our adaptation to climate change;
- We can model the occurrence of an event based on meteorological data collected by smart buoys;
- A probabilistic classifier based on a LightGBM can detect extreme weather events with a decent AUC score;
- A better feature engineering process may help improve this model.
Thank you for reading, and see you in the next story!
References
[1] Storm Events Database, data retrieved from https://www.ncdc.noaa.gov/stormevents/ftp.jsp (License: Public domain)
[2] National Data Buoy Center, data retrieved from https://www.ndbc.noaa.gov/ (License: Public domain)
[3] McGovern, Amy, et al. "Using Artificial Intelligence to improve real-time decision-making for high-impact weather." Bulletin of the American Meteorological Society 98.10 (2017): 2073–2090.
[4] Ramachandra, Vikas. "Weather event severity prediction using buoy data and machine learning." arXiv preprint arXiv:1911.09001 (2019).