Mixed Effects Machine Learning with GPBoost for Grouped and Areal Spatial Econometric Data
The GPBoost algorithm extends linear mixed effects and Gaussian process models by replacing the linear fixed effects function with a non-parametric non-linear function modeled using tree-boosting. This article shows how the GPBoost algorithm implemented in the GPBoost
library can be used for modeling data with a spatial and grouped structure. We demonstrate the functionality of the GPBoost
library using European GDP data which is an example of areal spatial econometric data.
Table of contents
∘ Introduction · · Basic workflow of GPBoost · · Data description · ·Data loading and short visualization ∘ Training a GPBoost model ∘ Choosing tuning parameters ∘ Model interpretation · ·Estimated random effects model · ·Spatial effect map · ·Understanding the fixed effects function ∘ Extensions · ·Separate random effects for different time periods · ·Interaction between space and fixed effects predictor variables · ·Large data · ·Other spatial random effects models · ·(Generalized) linear mixed effects and Gaussian process models ∘ References
Introduction
Basic workflow of GPBoost
Applying a GPBoost model (= combined tree-boosting and random effects / GP models) involves the following main steps:
-
Define a
GPModel
in which you specify the following:- A random effects model (e.g., spatial random effects, grouped random effects, combined spatial and grouped, etc.)
- The likelihood (= distribution of the response variable conditional on fixed and random effects)
- Create a
gpb.Dataset
with the response variable and fixed effects predictor variables - Choose tuning parameters, e.g., using the function
gpb.grid.search.tune.parameters
- Train the model with the
gpboost / gpb.train
function - Interpret the trained model an/or make predictions
In this article, we demonstrate these steps using a real world data set. Besides, we also show how to interpret fitted models, and we look at several extensions and additional features of the GPBoost
library. All results are obtained using GPBoost
version 1.2.1. This demo uses the R package, but the corresponding Python package provides the same functionality.
Data description
The data used in this demo is European gross domestic product (GDP) data. It can downloaded from https://raw.githubusercontent.com/fabsig/GPBoost/master/data/gdp_european_regions.csv. The data was collected by Massimo Giannini, University of Rome Tor Vergata, from Eurostat and kindly provided to Fabio Sigrist for a talk at the University of Rome Tor Vergata on June 16, 2023.
Data was collected for 242 European regions for the two years 2000 and 2021. I.e., the total number of data points is 484. The response variable is (log) GDP / capita. There are four predictor variables:
- L: (log) share of employment (empl/pop)
- K: (log) fixed capital/population
- Pop: log(population)
- Edu: share of tertiary education
Further, there are centroid spatial coordinates of every region (longitude and latitude), a spatial region ID for the 242 different European regions, and an additional spatial cluster ID which identifies the cluster the region belongs (there are two clusters).
Data loading and short visualization
We first load the data and create a map illustrating the (log) GDP / capita over space. We create two maps: one with all data and another one when excluding some remote islands. In the commented code below, we also show how to create a spatial plot when no shape file for spatial polygons is available.
library(gpboost)
library(ggplot2)
library(gridExtra)
library(viridis)
library(sf)
# Load data
data <- read.csv("https://raw.githubusercontent.com/fabsig/GPBoost/master/data/gdp_european_regions.csv")
FID <- data$FID
data <- as.matrix(data[,names(data)!="FID"]) # convert to matrix since the boosting part currently does not support data.frames
covars <- c("L", "K", "pop", "edu")
# Load shape file for spatial plots
cur_tempfile <- tempfile()
download.file(url = "https://raw.githubusercontent.com/fabsig/GPBoost/master/data/shape_European_regions.zip", destfile = cur_tempfile)
out_directory <- tempfile()
unzip(cur_tempfile, exdir = out_directory)
shape <- st_read(dsn = out_directory)
# Create spatial plot of GDP
data_plot <- merge(shape, data.frame(FID = FID, y = data[,"y"]), by="FID")
p1 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="GDP / capita (log)", option = "B") +
ggtitle("GDP / capita (log)")
# Sample plot excluding islands
p2 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="GDP / capita (log)", option = "B") +
xlim(2700000,6400000) + ylim(1500000,5200000) +
ggtitle("GDP / capita (log) -- excluding islands")
grid.arrange(p1, p2, ncol=2)
# # Spatial plot without a shape file
# p1 <- ggplot(data = data.frame(Lat=data[,"Lat"], Long=data[,"Long"],
# GDP=data[,"y"]), aes(x=Long,y=Lat,color=GDP)) +
# geom_point(size=2, alpha=0.5) + scale_color_viridis(option = "B") +
# ggtitle("GDP / capita (log)")
# p2 <- ggplot(data = data.frame(Lat=data[,"Lat"], Long=data[,"Long"],
# GDP=data[,"y"]), aes(x=Long,y=Lat,color=GDP)) +
# geom_point(size=3, alpha=0.5) + scale_color_viridis(option = "B") +
# ggtitle("GDP / capita (log) -- Europe excluding islands") + xlim(-10,28) + ylim(35,67)
# grid.arrange(p1, p2, ncol=2)

Training a GPBoost model
In the following, we use a Gaussian process model with an exponential covariance function to model spatial random effects. Additionally, we include grouped random effects for the cluster variable cl. In the GPBoost
library, Gaussian process random effects are defined by the gp_coords
argument and grouped random effects via the group_data
argument of the GPModel
constructor. The above-mentioned predictor variables are used in the fixed effects tree-ensemble function. We fit the model using the gpboost
, or equivalently the gpb.train
, function. Note that we use tuning parameters that are selected below using cross-validation.
gp_model <- GPModel(group_data = data[, c("cl")],
gp_coords = data[, c("Long", "Lat")],
likelihood = "gaussian", cov_function = "exponential")
params <- list(learning_rate = 0.01, max_depth = 2, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 0)
nrounds <- 37
# gp_model$set_optim_params(params = list(trace=TRUE)) # To monitor hyperparameter estimation
boost_data <- gpb.Dataset(data = data[, covars], label = data[, "y"])
gpboost_model <- gpboost(data = boost_data, gp_model = gp_model,
nrounds = nrounds, params = params,
verbose = 1) # same as gpb.train# same as gpb.train gpb.train
Choosing tuning parameters
It is important that tuning parameters are appropriately chosen for boosting. There are no universal default values and every data set will likely need different tuning parameters. Below we show how tuning parameters can be chosen using the gpb.grid.search.tune.parameters
function. We use the mean square error (mse
) as prediction accuracy measure on the validation data. Alternatively, one can also use, e.g., the test negative log-likelihood (test_neg_log_likelihood
= default value if nothing is specified) which also takes prediction uncertainty into account. _Depending on the data set and the grid size, this can take some time. Instead of a deterministic grid search as below, one can also do a random grid search to speed things up (see num_try_random
)._
gp_model <- GPModel(group_data = data[, "cl"],
gp_coords = data[, c("Long", "Lat")],
likelihood = "gaussian", cov_function = "exponential")
boost_data <- gpb.Dataset(data = data[, covars], label = data[, "y"])
param_grid = list("learning_rate" = c(1,0.1,0.01),
"min_data_in_leaf" = c(10,100,1000),
"max_depth" = c(1,2,3,5,10),
"lambda_l2" = c(0,1,10))
other_params <- list(num_leaves = 2^10)
set.seed(1)
opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params,
num_try_random = NULL, nfold = 4,
data = boost_data, gp_model = gp_model,
nrounds = 1000, early_stopping_rounds = 10,
verbose_eval = 1, metric = "mse") # metric = "test_neg_log_likelihood"
opt_params
# ***** New best test score (l2 = 0.0255393919591794) found for the following parameter combination: learning_rate: 0.01, min_data_in_leaf: 10, max_depth: 2, lambda_l2: 0, nrounds: 37
Model interpretation
Estimated random effects model
Information on the estimated random effects model can be obtained by calling the summary
function on the gp_model
. For Gaussian processes, GP_var
is the marginal variance, i.e., the amount of spatial correlation or structure spatial variation, and GP_range
is the range, or scale, parameter that measures how fast correlation decays over space. For an exponential covariance function, three times this value (approx. 17 here) is the distance where the (residual) spatial correlation is essentially zero (below 5%). As the results below show, the amount of spatial correlation is relatively small since the marginal variance of 0.0089 is small compared to the total variance of the response variable which is approx. 0.29. Additionally the error term and the cl grouped random effects also have small variances (0.013 and 0.022). We thus conclude that most of the variance in the response variable is explained by the fixed effects predictor variables.
summary(gp_model)
## =====================================================
## Covariance parameters (random effects):
## Param.
## Error_term 0.0131
## Group_1 0.0221
## GP_var 0.0089
## GP_range 5.7502
## =====================================================
Spatial effect map
We can plot the estimated ("predicted") spatial random effects at the training data locations by calling the predict
function on the training data; see the code below. Such a plot shows the spatial effect when factoring out the effect of the fixed effects predictor variables. Note that these spatial effects take into account both the spatial Gaussian process and the grouped region cluster random effects. If one wants to obtain only spatial random effects from the Gaussian process part, one can use the predict_training_data_random_effects
function (see the commented code below). Alternatively, one can also do spatial extrapolation (="Kriging"), but this makes not a lot of sense for areal data.
pred <- predict(gpboost_model, group_data_pred = data[1:242, c("cl")],
gp_coords_pred = data[1:242, c("Long", "Lat")],
data = data[1:242, covars], predict_var = TRUE, pred_latent = TRUE)
data_plot <- merge(shape, data.frame(FID = FID, y = pred$random_effect_mean), by="FID")
plot_mu <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="Spatial effect", option = "B") +
ggtitle("Spatial effect (mean)") + xlim(2700000,6400000) + ylim(1500000,5200000)
data_plot <- merge(shape, data.frame(FID = FID, y = pred$random_effect_cov), by="FID")
plot_sd <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="Std. dev.", option = "B") +
ggtitle("Uncertainty (std. dev.)") + xlim(2700000,6400000) + ylim(1500000,5200000)
grid.arrange(plot_mu, plot_sd, ncol=2)
# Only spatial effects from the Gaussian process ignoring the grouped random effects
# rand_effs <- predict_training_data_random_effects(gp_model, predict_var = TRUE)[1:242, c("GP", "GP_var")]
# data_plot <- merge(shape, data.frame(FID = FID, y = rand_effs[,1]), by="FID")
# plot_mu <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
# scale_fill_viridis(name="Spatial effect (mean)", option = "B") +
# ggtitle("Spatial effect from Gausian process (mean)") + xlim(2700000,6400000) + ylim(1500000,5200000)
# data_plot <- merge(shape, data.frame(FID = FID, y = rand_effs[,2]), by="FID")
# plot_sd <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
# scale_fill_viridis(name="Uncertainty (std. dev.)", option = "B") +
# ggtitle("Uncertainty (std. dev.)") + xlim(2700000,6400000) + ylim(1500000,5200000)
# grid.arrange(plot_mu, plot_sd, ncol=2)

Understanding the fixed effects function
There are several tools for understanding the form of the fixed effects function. Below we consider variable importance measures, interaction measures, and dependence plots. Specifically, we look at
- SHAP values
- SHAP dependence plots
- Split-based variable importance
- Friedman's H-statistic
- Partial dependence plots (one and two-dimensional)
- SHAP force plot
As the results below show, the information obtained from SHAP values and SHAP dependence plots is the same as when looking at traditional variable importance measures and partial dependence plots. The most important variables are ‘K' and ‘edu'. From the dependence plots, we conclude that there are non-linearities. For instance, the effect of K is almost flat for large and small values of K and increasing in-between. Further, the effect of edu is steeper for small values of edu and flattens for larger values of edu. Friedman's H-statistic indicates that there are some interactions. For the two variables with the largest amount of interaction, L and pop, we create a two-dimensional partial dependence plot below. Further, the SHAP force plot shows that the effect of the predictor variables is different for the year 2000 compared to the year 2021.
# SHAP values
library(SHAPforxgboost)
shap.plot.summary.wrap1(gpboost_model, X = data[,covars]) + ggtitle("SHAP values")

# SHAP dependence plots
shap_long <- shap.prep(gpboost_model, X_train = data[,covars])
shap.plot.dependence(data_long = shap_long, x = covars[2],
color_feature = covars[4], smooth = FALSE, size = 2) +
ggtitle("SHAP dependence plot for K")
shap.plot.dependence(data_long = shap_long, x = covars[4],
color_feature = covars[2], smooth = FALSE, size = 2) +
ggtitle("SHAP dependence plot for edu")


# SHAP force plot
shap_contrib <- shap.values(gpboost_model, X_train = data[,covars])
plot_data <- cbind(shap_contrib$shap_score, ID = 1:dim(data)[1])
shap.plot.force_plot(plot_data, zoom_in=FALSE, id = "ID") +
scale_fill_discrete(name="Variable") + xlab("Regions") +
geom_vline(xintercept=242.5, linewidth=1) + ggtitle("SHAP force plot") +
geom_text(x=100,y=-1.2,label="2000") + geom_text(x=400,y=-1.2,label="2021")

# Split-based feature importances
feature_importances <- gpb.importance(gpboost_model, percentage = TRUE)
gpb.plot.importance(feature_importances, top_n = 25, measure = "Gain",
main = "Split-based variable importances")

# H-statistic for interactions
library(flashlight)
fl <- flashlight(model = gpboost_model, data = data.frame(y = data[,"y"], data[,covars]),
y = "y", label = "gpb",
predict_fun = function(m, X) predict(m, data.matrix(X[,covars]),
gp_coords_pred = matrix(-100, ncol = 2, nrow = dim(X)[1]),
group_data_pred = matrix(-1, ncol = 1, nrow = dim(X)[1]),
pred_latent = TRUE)$fixed_effect)
plot(imp <- light_interaction(fl, v = covars, pairwise = TRUE)) +
ggtitle("H interaction statistic") # takes a few seconds

# Partial dependence plots
gpb.plot.partial.dependence(gpboost_model, data[,covars], variable = 2, xlab = covars[2],
ylab = "gdp", main = "Partial dependence plot" )
gpb.plot.partial.dependence(gpboost_model, data[,covars], variable = 4, xlab = covars[4],
ylab = "gdp", main = "Partial dependence plot" )


# Two-dimensional partial dependence plot (to visualize interactions)
i = 1; j = 3;# i vs j
gpb.plot.part.dep.interact(gpboost_model, data[,covars], variables = c(i,j), xlab = covars[i],
ylab = covars[j], main = "Pairwise partial dependence plot")

Extensions
Separate random effects for different time periods
In the above model, we have used the same random effects for both years 2000 and 2021. Alternatively, one can also use independent spatial and grouped random effects for different time periods (independent under the prior, conditional on the data, there is dependence …). In the GPBoost
library, this can be done via the cluster_ids
argument. cluster_ids
needs to be a vector of length the sample size, and every entry indicates the "cluster" to which an observation belongs to. Different clusters then have independent spatial and grouped random effects, but the hyperparameters (e.g., marginal variance, variance components, etc.) and the fixed effects function are the same across clusters.
Below we show how we can fit such a model and create two separate spatial maps. As the results below show, the spatial effects are quite different for the two years 2000 and 2021. In particular, there is less (residual) spatial variation (or heterogeneity) for the year 2021 compared to 2000. This is confirmed by looking at standard deviations of the predicted spatial random effects which is almost twice as large for 2000 compared to the year 2021 (see below). A further conclusion is that in 2000, there were more regions with "low" GDP numbers (spatial effects below 0), and this is no longer the case for 2021.
gp_model <- GPModel(group_data = data[, c("cl")], gp_coords = data[, c("Long", "Lat")],
likelihood = "gaussian", cov_function = "exponential",
cluster_ids = c(rep(1,242), rep(2,242)))
boost_data <- gpb.Dataset(data = data[, covars], label = data[, "y"])
params <- list(learning_rate = 0.01, max_depth = 1, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 1)
# Note: we use the same tuning parameters as above. Ideally, the would have to be chosen again
gpboost_model <- gpboost(data = boost_data, gp_model = gp_model, nrounds = nrounds,
params = params, verbose = 0)
# Separate spatial maps for the years 2000 and 2021
pred <- predict(gpboost_model, group_data_pred = data[, c("cl")],
gp_coords_pred = data[, c("Long", "Lat")],
data = data[, covars], predict_var = TRUE, pred_latent = TRUE,
cluster_ids_pred = c(rep(1,242), rep(2,242)))
data_plot <- merge(shape, data.frame(FID = FID, y = pred$random_effect_mean[1:242]), by="FID")
plot_mu_2000 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="Spatial effect", option = "B") +
ggtitle("Spatial effect for 2000 (mean)") + xlim(2700000,6400000) + ylim(1500000,5200000)
data_plot <- merge(shape, data.frame(FID = FID, y = pred$random_effect_mean[243:484]), by="FID")
plot_mu_2021 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="Spatial effect", option = "B") +
ggtitle("Spatial effect for 2021 (mean)") + xlim(2700000,6400000) + ylim(1500000,5200000)
grid.arrange(plot_mu_2000, plot_mu_2021, ncol=2)
sd(pred$random_effect_mean[1:242]) # 0.2321267
sd(pred$random_effect_mean[243:484]) # 0.1286398

Interaction between space and fixed effects predictor variables
In the above model, there is no interaction between the random effects and the fixed effects predictor variables. I.e., there is no interaction between the spatial coordinates and the fixed effects predictor variables. Such interaction can be modeled by additionally including the random effects input variables (= the spatial coordinates or the categorical grouping variable) in the fixed effects function. The code below shows how this can be done. As the variable importance plot below shows, the coordinates are not used in the tree-ensemble, and there is thus no such interaction detectable for this data set.
gp_model <- GPModel(group_data = data[, c("cl")], gp_coords = data[, c("Long", "Lat")],
likelihood = "gaussian", cov_function = "exponential")
covars_interact <- c(c("Long", "Lat"), covars) ## add spatial coordinates to fixed effects predictor variables
boost_data <- gpb.Dataset(data = data[, covars_interact], label = data[, "y"])
params <- list(learning_rate = 0.01, max_depth = 1, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 1)
# Note: we use the same tuning parameters as above. Ideally, the would have to be chosen again
gpboost_model <- gpboost(data = boost_data, gp_model = gp_model, nrounds = nrounds,
params = params, verbose = 0)
feature_importances <- gpb.importance(gpboost_model, percentage = TRUE)
gpb.plot.importance(feature_importances, top_n = 25, measure = "Gain",
main = "Variable importances when including coordinates in the fixed effects")

Large data
For large data sets, calculations with Gaussian processes are slow, and one has to use an approximation. The GPBoost
library implements several ones. For instance, setting gp_approx="vecchia"
in the GPModel
constructor will use a Vecchia approximation. The data set used in this article is relatively small, and we can do all calculations exactly.
Other spatial random effects models
Above, we have used a Gaussian process to model spatial random effects. Since the data is areal data, another option is to use models that rely on neighborhood information such as CAR and SAR models. These models are currently not yet implemented in the GPBoost
library (might be added in the future -> contact the author).
Another option is to use a grouped random effects model defined by the categorical region ID variable for modeling spatial effects. The code below shows how this can be done in GPBoost
. However, this model essentially ignores the spatial structure.
gp_model <- GPModel(group_data = data[, c("group", "cl")], likelihood = "gaussian")
(Generalized) linear mixed effects and Gaussian process models
(Generalized) linear mixed effects and Gaussian process models can also be run in the GPBoost
library. The code below shows how this can be done with the fitGPModel
function. Note that one needs to manually add a column of 1's to include an intercept.
X_incl_1 = cbind(Intercept = rep(1,dim(data)[1]), data[, covars])
gp_model <- fitGPModel(group_data = data[, c("cl")], gp_coords = data[, c("Long", "Lat")],
likelihood = "gaussian", cov_function = "exponential",
y = data[, "y"], X = X_incl_1, params = list(std_dev=TRUE))
summary(gp_model)
## =====================================================
## Model summary:
## Log-lik AIC BIC
## 195.97 -373.94 -336.30
## Nb. observations: 484
## Nb. groups: 2 (Group_1)
## -----------------------------------------------------
## Covariance parameters (random effects):
## Param. Std. dev.
## Error_term 0.0215 0.0017
## Group_1 0.0003 0.0008
## GP_var 0.0126 0.0047
## GP_range 7.2823 3.6585
## -----------------------------------------------------
## Linear regression coefficients (fixed effects):
## Param. Std. dev. z value P(>|z|)
## Intercept 16.2816 0.4559 35.7128 0.0000
## L 0.4243 0.0565 7.5042 0.0000
## K 0.6493 0.0212 30.6755 0.0000
## pop 0.0134 0.0097 1.3812 0.1672
## edu 0.0100 0.0009 10.9645 0.0000
## =====================================================
References
- Sigrist Fabio. "Gaussian Process Boosting". Journal of Machine Learning Research (2022).
- Sigrist Fabio. "Latent Gaussian Model Boosting". IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
- https://github.com/fabsig/GPBoost
- The source of the dataset is Eurostat; see https://ec.europa.eu/eurostat/about-us/policies/copyright for more information on the rights ("Eurostat has a policy of encouraging free re-use of its data, both for non-commercial and commercial purposes.")