A Data Scientist Friendly Variogram Tutorial for Quantifying Spatial Continuity

Introduction
Variograms are used to demonstrate the distance-based variability of spatial data. Understanding and modeling spatial continuity with variograms is important as they are used to estimate point measurements into practical blocks across a wide range of applications such as mining ore grades, oil concentrations, or the environmental contaminants.
Despite open-source options being available to generate variograms, due to their complexity, most users rely on expensive software packages which abstract a lot of the details. This tutorial aims to give a brief introduction to variograms and how the open source Geostatistics Library (GSLib) which can be used independently or with Python to develop variograms.
Here a variogram model is developed on a synthetic mining dataset but the workflow could be used for any kind of spatial data for meteorological applications like temperature, or environmental applications like contaminant tracking.
Tutorial Requirements
We'll need GSLib which is available here for free download and some of the most basic, commonly used Python libraries which are also in the complete code uploaded to github:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Variogram Basics
The general idea of variograms are that data points further away from each other are more likely to be more distinct than data points close to one another. The variance of data points further and further apart eventually reaches a point where it is equal to the global variance of the data.
We start with a spatial dataset and can generalize the variogram modeling workflow into a few steps as shown below. First we need to determine adequate search parameters for the variogram. Then identify the major and minor continuity axis. Finally the variograms can be then modeled and subsequently used for estimation or simulation purposes. Each of the steps will be further explained in the following sections and applied on a dataset in the tutorial.

Variogram Modeling Parameters: Nugget, Range, and Sill
Variograms models are made up of three components which describe different attributes about the data:
- Nugget: Y intercept of a variogram model indicating the immediate variability at near zero distance or randomness at a single point
- Range: Distance after which the variogram plateaus indicating points at that distance are no longer spatially correlated
- Sill: Variance of the modeled data, once the variance of points a certain distance apart meet it, there is no longer spatial correlation
The variogram schematic below illustrates each of the model parameters. Before reaching the range the data is considered to be spatially correlated. Variograms with longer ranges are more spatially continuous than those with shorter ranges. High nugget variograms indicate a high variability at near zero distances and are typical of more variable, discontinuous datasets.

Variogram Development
To generate experimental variogram data we need to define pairs. These pairs are composed of a head and tail a specified distance apart. We increase the distance to test the average difference in the pairs and understand how it varies as a function of distance. If we are working with 2D data, a few different search parameters are standards in variogram modeling and illustrated below. The search parameters come with tolerances as it is highly unlikely we'll find a match at exactly 2.0000 m a certain direction away.

After defining the parameters, the search is iteratively applied on all points to develop the experimental variogram data at each lag as shown below for one point. The average variogram is calculated at each lag and eventually reaches the sill or variance of the global data.

The calculation done at each lag is simply half the difference squared of all pairs as illustrated below. This variogram value is plotted against the lag to develop the variogram.

Developing Experimental Variogram Data with GSLib's gamv
Now to actually calculate the experimental variogram data with GSLib's gamv we will first need to define the search parameters. Below is the search parameter schematic next to an annotated screenshot indicating the corresponding GSLib gamv parameter file input.

After setting the parameters, make sure your spatial data is in the right .dat format for GSLib like shown below for our spatial copper grade data example.

Now to run the GSLib code you can use the command window, navigate to the working directory, and exectute gamv as shown below:

The output variogram will be saved to gamv.out with the data sorted as shown below:

After that you can analyze the output any way you'd like, we will use Python in this Tutorial.
Application on a Synthetic Mining Dataset
The synthetic blasthole dataset we are using is modeled after a real Cu mine. The blasthole data is scattered ~7×7 m in 2D with easting and north measured in meters and each point has a measured copper grade associated with it measured in weight percent (wt%). All the code and data used in this tutorial is available on github. A map showing the copper grades and how they're scattered in space can be seen below.

The following GSLib Variogram search parameters below are adequate for ~7×7 m spatial data and we'll use them to generate our variograms. Here is a great source for determining adequate search parameters depending on the spatial data.
- Lag separation distance: 7 m, this is roughly the data spacing
- Lag tolerance: 3.5 m, rule of thumb is to make the lag tolerance half the spacing
- Number of lags: 30, Most the area is covered in 30*7m (210 m)
- Azimuth tolerance: 22.5 degrees, could be slightly smaller say 15 degrees butwe want to make sure we can get a good number of pairs
- Bandwidth: 10 m, could be slightly smaller say 7.5 m but we want to make sure we can get a good number of pairs
- Variogram type: Semivariogram
Identifying the Principal Directions of Minimum and Maximum Discontinuity
Since we are working in 2D, we'll need to identify the two orthogonal directions of high and low continuity often referred to as the major and minor axis respectively. There are various ways to determine the principal directions, the most basic way presented here is to generate multiple variogram pairs at equal intervals and find the two which encompass the highest and lowest spatial continuity. Other useful methods include generating variogram maps which visualize the spatial continuity, GSLib offers packages for variogram maps and there is a guide to use them here.
Below we see variogram pairs developed at 15 degree intervals, it is clear to see that the variograms with azimuth of 000° and 090° have the highest and lowest continuity respectively.

We can quickly check with the grade map of the 000° and 090° azimuth directions make reasonable sense for the and least continuous directions respectively. As we can see with the map reproduced below it makes sense that the lowest continuity is west-east as we have the highest grades transitioning to the lowest in that direction while the grades don't vary as much north-south.

Variogram Modeling
The experimental variogram data for the 000° and 090° azimuth was modeled using an exponential model. Different nuggets, ranges, and partial sills were tested until a qualitatively good fit was achieved. Better fits to the experimental data might have been achieved using nested structures with multiple models or exploring different models like the most common spherical model using in mining.

Now the variogram model can be used for any kind of estimation or simulation of the point data into blocks. GSLib also offers the packages required to run commonly used ordinary krigging estimations or sequential gaussian simulations.
Conclusions
Understanding and modeling spatial continuity is challenging, especially for those without a geostatistics background. Here we covered a brief introduction and quick tutorial on variograms and generating variogram models with open-source software. Many users currently use expensive software as there is an overall lack in coding experience for spatial continuity analysis and open-source software such as GSLib are not widely known.
Also, many of the commercial software abstract important details of the variogram which could impact the developed models. Various geostatistical texts emphasize the importance of not treating variogram modeling as a black box and the effort spent trying to understand what is going on in the data can be very useful to generate robust variogram models.