Precisely Compare Geographical Regions with GeoPandas

Author:Murphy  |  View: 23968  |  Time: 2025-03-22 20:10:51

QUICK SUCCESS DATA SCIENCE

A map of Texas color-filled with the area of Ukraine (by the author)

I previously wrote an article on how to compare the sizes of geographical regions, such as countries and states, by overlaying their maps. Here's an example using the state of Texas and the country of Ukraine:

Superimposed maps of Texas and Ukraine (by the author)

This approach is effective, except when regions have significantly different shapes. For instance, in the following figure, which is larger: Chile or Texas?

Superimposed maps of Texas and Chile (by the author)

This issue led Medium member Anurag to leave the following (paraphrased) comment on the article:

"Is it possible to fill the area of other countries with the area of Texas? Unlike just overlapping the two countries on top of each other. I think it would be easier to observe and understand."

This was an intriguing suggestion, and the answer is yes. You can do it. Here's an example where Chile, the larger region, is ‘filled' with the area of Texas:

A map of Chile color-filled with the area of Texas (by the author)

Now there's no doubt that Chile is larger than Texas. This is a handy method for when the shapes of the two regions are very different or the areas are very similar.

In this Quick Success Data Science project, I'll show you how to use Python to fill the map of one region with the area of another.


The Process

Here's the (very) high-level process we'll use:

  1. Get the shapefiles (polygons) of the two regions
  2. Calculate their areas
  3. Plot the boundary (map) of the largest region
  4. Fill the map with a colored polygon
  5. Whittle away at the polygon until it's the size of the smaller region

We'll use shapefiles and Python's Geopandas library to accomplish these tasks. For an overview of shapefiles and GeoPandas, including installation instructions, please visit my original article:

Comparing Country Sizes with GeoPandas


The Code

The following code takes as input a Shapefile of a US state and a country, determines which is the largest, and plots the area of the smaller region inside the map of the larger region, filling it with color from left to right. While this should be appropriate for most countries, we'll also look at a bottom-to-top filling approach for vertically oriented countries and states like Chile and California.

Importing Libraries

GeoPandas, like pandas, is built on Matplotlib, so we'll import Matplotlib for plotting. The shapely library will help us slice away the color-filled area until it matches the smaller region. Both Matplotlib and shapely are dependencies of GeoPandas, and are installed automatically with GeoPandas.

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import geopandas as gpd
from shapely.geometry import box

Assigning Constants

For convenience, we'll assign constants at the program's top, where they'll be easy to find and change. These include the state and country name, a coordinate reference system (CRS) for projecting from a 3D globe to a 2D map, and an integer that we'll use to set the width of the slices we'll shave from the color-filled polygon. The larger this number, the finer slice, and the more accurately we can match the area of the smaller region.

# Constants
STATE = 'Texas'
COUNTRY = 'Chile'
CRS_STATE = 'EPSG:5070'  # Coordinate Reference System
CRS_COUNTRY = 'EPSG:20048'  # Coordinate Reference System
DIVISOR = 1_000  # Larger values increase accuracy and run time

See the original article referenced previously for more on coordinate reference systems and EPSG codes.

HINT: To find the EPSG code for a country, search online using queries such as, "What is the recommended EPGS code for Australia?".

Selecting and Projecting Shapefiles

GeoPandas used to ship with the Natural Earth public domain dataset, which includes shapefiles for country and state boundaries. This was deprecated in early 2024, so we've got to download the file manually.

Fortunately, this is easy. Just navigate to this site and click the download link highlighted in yellow below to get the state boundaries :

Click the highlighted link to download the "states and provinces" shapefile (by the author)

To get the country boundaries, click the highlighted link below:

Click the highlighted link to download the "countries" shapefile (by the author)

Move the zipped files into the folder containing your Python script or notebook, then run the following code:

# Download the zipped files used below from:
# https://www.naturalearthdata.com/downloads/110m-cultural-vectors/
world = gpd.read_file('ne_110m_admin_0_countries.zip')
usa = gpd.read_file('ne_110m_admin_1_states_provinces.zip')

# Rename columns for consistency
world = world.rename(columns={'ADMIN': 'name'})

# Create GDFs for each region
state = usa[usa.name == STATE]
country = world[world['SOVEREIGNT'] == COUNTRY]

# Project to CRS
state = state.to_crs(CRS_STATE)
country = country.to_crs(CRS_COUNTRY)

The country shapefile is assigned to the world variable and loaded using GeoPandas' read_file() method. It's too big to show here, but if you want to see the contents, enter world.head(), just as you would for a pandas DataFrame.

The world and usa GeoDataFrames use different column names for the name of the country and state, respectively. This will prove inconvenient later, so we'll set the world column name ("ADMIN") to "name".

Next, we create the state and country GeoDataFrames based on the STATE and COUNTRY constants.

The final two lines convert the shapefiles to the proper coordinate reference system, as specified by the EPSG codes.

Determining the Largest and Smallest Regions

Now we calculate the area of each region and determine which is the biggest and smallest.

# Calculate areas
area1 = state.geometry.area.sum()
area2 = country.geometry.area.sum()

# Determine biggest and smallest regions
biggest, smallest = ((state, country) if area1 >= area2 
                     else (country, state))

What's happening here? Well, the state.geometry.area.sum() line accesses the geometry column of the state GeoDataFrame, which contains the geometric shapes (polygons) representing the state.

Toy example of a geometry column in a GeoDataFrame (by the author)

The area property calculates the area of each polygon in the geometry column. The result is a GeoSeries of area values. The sum() method then sums up the areas in the GeoSeries to get the total area of the geometries in the state GeoDataFrame.

The total area of the geometries will be in the units of the CRS. In this case, we're using square meters.

Filling the Map with the Smaller Area

Next, we fill the larger area map with a form-fitting colored polygon. We then iteratively slice away the colored polygon until its area matches the area of the smaller region, represented by the target_area variable.

# Create filled area
filled_area = biggest.unary_union

# Define target area
target_area = smallest.geometry.area.sum()

# Remove slices until the remaining area matches target
remaining_area = filled_area
while remaining_area.area > target_area:
    minx, miny, maxx, maxy = remaining_area.bounds
    slice_width = (maxx - minx) / DIVISOR
    slice_to_remove = box(maxx - slice_width, miny, maxx, maxy)
    remaining_area = remaining_area.difference(slice_to_remove)

# Convert remaining area to GeoDataFrame
remaining_area_gdf = gpd.GeoDataFrame({'geometry': [remaining_area]}, 
                                      crs=biggest.crs)

First, we create a polygon the same size and shape as the map of the biggest region. The unary_union property merges all the geometries in the biggest GeoDataFrame into one single geometry, which is then assigned to the variable filled_area. The target_area represents the area of the smallest region.

The remaining_area variable is a copy of the filled_area that we can modify until its area matches the target_area value.

The while loop is the secret sauce that makes all this possible. It runs as long as the remaining area exceeds the target area. The loop body consists of the following steps:

  • remaining_area.bounds retrieves the bounding box coordinates (minx, miny, maxx, maxy) of the current remaining_area.
  • slice_width is calculated as a fraction of the width of the bounding box, determined by the DIVISOR constant.
  • slice_to_remove creates a vertical rectangular slice (using shapely.geometry.box) at the right edge of the bounding box using the width stored in slice_width.
  • remaining_area.difference(slice_to_remove) subtracts the slice_to_remove from remaining_area, effectively removing a small slice from the right edge.

Here's a view of the biggest, remaining area, and bounding box polygons at the start of the loop:

The bounding box is shown in blue. Iterative slices will start on the right and move left. (by the author)

Once the loop conditions are met, the remaining_area is converted into a GeoDataFrame (remaining_area_gdf) with the same CRS as biggest. Now we're ready to plot.

Plotting the Results

We'll use Matplotlib to overlay the map of the biggest region with the remaining area polygon.

# Plot
fig, ax = plt.subplots(figsize=(10, 10))
biggest.boundary.plot(ax=ax, color='black')
remaining_area_gdf.plot(ax=ax, color='red', alpha=0.5)

# Add legend
smallest_name = smallest['name'].iloc[0]
red_patch = mpatches.Patch(color='red', 
                           alpha=0.5, 
                           label=smallest_name)
plt.legend(handles=[red_patch], loc='best')

# Customize plot
plt.xticks([])
plt.yticks([])
plt.title(f"{biggest['name'].iloc[0]} vs. {smallest_name} Area Comparison")

plt.show()

The biggest.boundary code accesses the boundary property of the biggest GeoDataFrame. This property returns a GeoSeries containing the outlines of each geometry in the GeoDataFrame, which forms the outline of Texas.

The Patch class helps us manually build a legend with a rectangular patch of color attached to a label.

Here's the output, as shown previously:

The filled map of Chile (by the author)

The area of the pink-colored region is equivalent to the area of Texas.

Given Chile's aspect ratio, filling it from bottom to top might look better. To switch the fill direction, change the slice_height and slice_to_remove lines in the while loop as below:

    slice_height = (maxy - miny) / DIVISOR
    slice_to_remove = box(minx, maxy - slice_height, maxx, maxy)

Now, the color-fill extends from bottom to top, like water filling a pipette:

Chile filled from bottom to top (by the author)

Fine-tuning the Results

Remember our DIVISOR constant? This value sets the width of the slices removed from the remaining_area polygon with each iteration of the while loop. If this value is too small, the slices will be thick, and we won't be able to reproduce the target area. If it's too large, the program might run for an unnecessarily long time.

To check the performance of the constant, we'll print out the final areas of the target value and the remaining area polygon:

# Check smallest vs. remaining area
print(f"Smallest area: {target_area:,} sq. meters")
print(f"Remaining area: {remaining_area.area:,} sq. meters")
Smallest area: 691,287,205,135.4048 sq. meters
Remaining area: 691,007,387,015.6472 sq. meters

That's pretty close. If you want more precision, change the DIVISOR constant to a larger number, such as 1500 or 2000.


Summary

To accurately compare the sizes of two geographical regions, we used GeoPandas and shapefiles to fill the map outline of one region with a colored polygon representing the smaller region. The example code was designed to compare U.S. states to other countries but can be easily adapted for state-to-state or country-to-country comparisons.


If you found this article useful and would like to support me, please buy me a coffee. I promise I'll drink it.

Photo by Mayumi Maciel on Unsplash

Thanks!

Thanks for reading and clapping and please follow me for more Quick Success Data Science projects in the future.

Tags: compare-map-areas Geopandas geospatial-data-science Shapefile Tips And Tricks

Comment