Data Augmentation in AI for Science: An Earth Science Case Study

Background
In the era of big data, AI boosts scientific research with potential in forecasting, emulating, and knowledge mining. AI for Science (AI4Science) is becoming a new scientific paradigm. One of the obstacles to using AI is limited data – especially observational ground-truth data. Fortunately, these data can be augmented by another data source in scientific communities, i.e., the simulated data generated by physical models.
Why Augment Data with Physical Simulations?
Data augmentation is a powerful technique to improve the performance of AI models. Here are three key reasons for incorporating simulated data:

- More data: Observational data are often limited due to the expensive data collection. Simulated data provide a larger dataset at a lower cost.
- Generalization: Observational data are unevenly distributed across space and time. Simulated data, on the other hand, have seamless spatial-temporal coverage, helping models to generalize better.
- Robustness: Observational data suffer from errors caused by instrument failures. Simulated data constrained by physical laws tend to have fewer outliers, improving the model's robustness.
How to Implement Data Augmentation in AI4Science
Incorporating simulated data into AI models does not necessarily improve the accuracy of predictions based on observational data. It requires skills for extracting additional information and reducing noise from the simulated data. Using streamflow prediction in earth science as an example, this article provides a tutorial to illustrate key techniques to integrate data augmentation in AI4Science model development:
- Adversarial Validation: Filtering biased simulated samples
- Sample weighting: Avoiding overfitting simulated data
- Enhanced cross-validation: Separating the roles of observational and simulated data
Case Study: Streamflow Prediction in Hydrology
Getting Started
In hydrology, predicting the streamflow of a watershed is a complex task due to the interaction of various meteorological variables such as precipitation, soil moisture, and evaporation. Physical models use dynamic meteorological variables to make continuous simulations of streamflow at each time step for each watershed. For AI models, we can pool data from multiple watersheds and use both dynamic meteorological variables and static watershed attributes as inputs. As depicted in Figure 2, we formulate streamflow prediction as a supervised tabular regression problem, meaning we do feature engineering with domain knowledge and utilize it to train a deep learning model. Here we leverage both observational and simulated data to train the model.

Adversarial Validation: Filtering Biased Simulated Samples
Not all simulated data are suitable as augmented samples. Physical models with no or limited calibration may consistently generate target variables with systematic bias. In our case, the physical model underestimates streamflow caused by extremely high rainfall, as Figure 3(a) shows. To filter simulated samples that have a severe bias in the relationship between rainfall and streamflow, we can train a classifier to separate observations and simulations. Samples with a high probability of being simulated data should be removed since they fall the farthest from the distribution of the observational data.

This technique is known as adversarial validation, which is frequently used to identify the difference between training and testing data to avoid distribution shift. In practice, we train a simple logistic regression model and set the classification threshold to a low value, i.e., 0.1, which filters out simulated samples that are highly biased.
import pandas as pd
from sklearn.linear_model import LogisticRegression
def remove_adversarial(df: pd.DataFrame):
'''Removing simulated samples with distribution shift
:param df: all data
'''
X = df[["rain", "runoff"]]
y = df["runoff_type"]
clf = LogisticRegression(random_state=42)
clf.fit(X, y)
result = pd.DataFrame({"rain": X["rain"], "runoff": X["runoff"], "y": y, "y_prob": clf.predict_proba(X)[:, 0]})
result["y_pred"] = "obs"
threshold = 0.1
result.loc[result["y_prob"] < threshold, "y_pred"] = "sim"
remove_indexes = result[(result["y"] == "sim") & (result["y_pred"] == "sim")].index
df = df.drop(remove_indexes).reset_index(drop=True)
return df
Sample Weighting: Avoiding Overfitting Simulated Data
Simulated data may bring noise to training samples, which derails us from the ultimate goal, i.e., obtaining high accuracy of predictions on observational data. To reduce potential noise, more importance should be put on observational data. Sample weighting is a prevailing technology in adjusting training sample importance. By using WeightedRandomSampler
from the Pytorch
library, we assign a weight of 2/3 and 1/3 to observational samples ("obs"
) and simulated samples respectively, making observational samples be fed to the model twice as often as simulated samples during the training process.
from torch.utils.data import WeightedRandomSampler
import numpy as np
def weighting(train: pd.DataFrame):
'''Weighting observational & simulated samples
:param train: training data
:return: a WeightedRandomSampler to be passed to model fitting
'''
weight_obs = 2 / 3
weights = np.array([weight_obs if sample_type == "obs" else (1 - weight_obs) for sample_type in train["runoff_type"]])
weights = weights / weights.sum()
train_sampler = WeightedRandomSampler(weights, num_samples=len(weights), replacement=True)
return train_sampler
Enhanced Cross-Validation: Separating the Roles of Observational and Simulated Data
For model development, we use the Pytorch-tabular
library, a tool for tabular deep learning built on Pytorch
and Pytorch Lightning
. We define _get_tabular_model
and _update_hyperparams
methods to encapsulate necessary APIs of Pytorch-tabular
for preparing a model. Please refer to Pytorch-tabular
‘s documentation for details.
An enhanced cross-validation method aims to train the model with both observational & simulated data but only validate it with observational data. In this way, the model leverages information from the simulated data while pivoting the evaluation to the observational data. In the code below, we use KFold
to split only the observational data (original_df
) and train_test_split
to further split the training folds into a training set (train
) and a validation set (valid
) for early stopping. All simulated data (synthetic_df
) are added only to the training set (train
) as Data Augmentation with a weighted sampler train_sampler
.
from pytorch_tabular import TabularModel
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
class TorchTabularAugModel:
def _get_tabular_model(self, sample_df: pd.DataFrame, model_name: str, model_params: dict):
'''Creating a TabularModel'''
...
def _update_hyperparams(self, tabular_model, new_params):
'''Updating hyper-parameters of an existing TabularModel'''
...
def enhanced_cross_validate(self, sample_df: pd.DataFrame, model_name: str, hyper_params={}, new_setting={}, k=5):
'''Enhanced cross-validation with separated observational & simulated data
:param sample_df: all data
:param model_name: model name
:param hyper_params: hyper-parameters
:param new_setting: training settings, i.e., lr scheduler & target transformation
:param k: number of folds
:return: cross-validation scores & out-of-fold predictions
'''
original_df = sample_df[sample_df['runoff_type'] == 'obs'].reset_index(drop=True)
synthetic_df = sample_df[sample_df['runoff_type'] == 'sim'].reset_index(drop=True)
kf = KFold(n_splits=k, shuffle=True, random_state=42)
cv_scores = []
oof_predictions = []
for fold, (train_idx, val_idx) in enumerate(kf.split(original_df)):
# Combine indices for this fold, making sure to include the synthetic data in the training set.
train = original_df.iloc[train_idx].reset_index(drop=True)
test = original_df.iloc[val_idx]
# Use a validation set for early stopping
train, valid = train_test_split(train, test_size=0.3, random_state=42)
train, valid = train.reset_index(drop=True), valid.reset_index(drop=True)
# Add simulated data to the training set
train = pd.concat([train, synthetic_df]).reset_index(drop=True)
# Sampler that assigns different weights to obs & sim data
train_sampler=weighting(train)
tabular_model = self._get_tabular_model(train, model_name, {})
self._update_hyperparams(tabular_model, hyper_params)
tabular_model.fit(
train=train,
validation=valid,
train_sampler=train_sampler
)
y_pred = tabular_model.predict(test)
score = mean_squared_error(test["runoff"], y_pred)
cv_scores.append(score)
oof_predictions.append(y_pred)
return cv_scores, pd.concat(oof_predictions).sort_index()
The enhanced cross-validation is used for hyper-parameter tuning. The code below presents the case for tuning a FTTransformer model. The best hyper-parameters are determined by a random search strategy for search_space
combined with the enhanced_cross_validate
method.
class TorchTabularAugModel:
...
def tuning(self, sample_df: pd.DataFrame) -> pd.DataFrame:
'''Tuning hyper-parameters
:param sample_df: all data
:return: a table of tuning results (hyper-parameter sets and scores)
'''
model_name = "FTTransformerConfig"
search_space = {
"model_config__num_heads": [2, 4, 8],
"model_config__num_attn_blocks": [2, 3, 4, 6],
"model_config__attn_dropout": uniform(0, 0.4),
"model_config__ff_dropout": uniform(0, 0.4),
"model_config.head_config__dropout": uniform(0, 0.4),
}
n_trials = 50
def param_sampler(search_space):
sample = {}
for key, value in search_space.items():
if isinstance(value, list):
sample[key] = random.choice(value)
else:
sample[key] = value.rvs()
if type(sample[key]) != int:
sample[key] = float(sample[key])
return sample
param_sets = [param_sampler(search_space) for i in range(n_trials)]
results = []
for i, hyper_params in enumerate(param_sets):
print(hyper_params)
cv_scores, oof_predictions = self.enhanced_cross_validate(sample_df, model_name=model_name, hyper_params=hyper_params, new_setting={}, k=5)
result_item = deepcopy(hyper_params)
result_item["cv_score"] = cv_scores
results.append(result_item)
return pd.DataFrame(results)
Summary
Augmenting observational data with simulated data from physical models improves the accuracy, generalization, and robustness of AI models. This article introduces effective techniques such as adversarial validation, sample weighting, and enhanced cross-validation, using a case in Earth Science. These methods represent just a tip of the iceberg in the realm of data augmentation, but they reveal possibilities for innovative applications across scientific disciplines.