Visualizations of New York City

Author:Murphy | View: 25116 | Time: 2025-03-23 12:59:56

New York City's open data platform is an incredible source of information. All public data collected and generated by the city is mandated by law to be made available through the portal, as well as being free for use by the public.

Datasets range from transport, housing, and motor vehicle incidents, to a Central Park squirrel census, and even park ranger reports of aggressive turtle encounters.

Geography, infrastructure, and sociology datasets like these represent real-world processes and events. Even if you have no connection to or little interest in NYC or urban areas in general, they give you a chance to work with data that looks a lot more like what you'll encounter in a professional role than the likes of MNIST or Titanic survivors. Better still, they're almost as easy to access.

We're going to run through a demonstration of just how easy these datasets are to use and build some interesting visuals in the process.

To keep the code blocks as succinct as possible, here are the required modules for all the code in this post:

Python">import folium
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import requests
from scipy.stats import gaussian_kde
import seaborn as sns
from shapely.geometry import Point, shape, box, Polygon

Make sure they're installed if you want to replicate anything yourself.

Building Footprints

Dataset

This is one of my favorite datasets to play around with. The data includes footprint polygons, ages, and heights for most of the buildings in NYC.

We'll start with data pull separate from the visualization code since we're using this dataset for a couple of different visuals.

# Pull data
api_endpoint = 'https://data.cityofnewyork.us/resource/qb5r-6dgf.json'
limit = 1000  # Number of rows per request
offset = 0   # Starting offset
data_frames = []  # List to hold chunks of data

# Loop to fetch data iteratively
# while offset <= 100000: # uncomment this and comment while True to fetch a 
#   sample much faster
while True: # while True will take a long time but gets all the data
    url = f"{api_endpoint}?$limit={limit}&$offset={offset}"
    chunk = pd.read_json(url)
    if chunk.empty:
        break  # Stop the loop when no more data is returned
    data_frames.append(chunk)
    offset += limit

# Concatenate all chunks into a single DataFrame
data = pd.concat(data_frames, ignore_index=True)

# Convert the 'the_geom' column from a dictionary to a Shapely geometry object
data['geometry'] = data['the_geom'].apply(lambda x: shape(x))

# Convert the Pandas DataFrame to a GeoDataFrame
gdf = gpd.GeoDataFrame(data, geometry='geometry', crs="EPSG:4326")

# Convert 'MultiPolygon' to representative points for visualization
gdf['centroid'] = gdf['geometry'].centroid.to_crs(epsg=3395).centroid.to_crs(epsg=4326)

# Get rid of columns we don't need anymore
keep_cols = ['cnstrct_yr', 'heightroof', 'geometry', 'centroid']
gdf = gdf[keep_cols]

That gives us a working dataset that looks like this.

The geometry column is the polygon for the building footprint, and the centroid column is a singular point for the center of the footprint, i.e. a single lat/long for the building's location. Now we can get into the fun stuff.

Visualizing New vs Old

At this point, the city is a mashup of old brownstones, densely packed single-family homes and rowhouses, blocky apartment complexes, and glass towers overlooking the rivers.

Say we want to figure out where to live and we want a neighborhood with a lot of older buildings for some historical charm. In many American cities and particularly in the Northeast, buildings from the 1930's and earlier are called "prewar buildings", or "prewars", colloquially.

To find a neighborhood that might suit what we're looking for, we can use a scatter plot on a map using the centroid for each building as the point location, and a KDE plot to highlight high density of prewars.

# Create the bounding box from the provided corner points
bounding_box = box(-74.0294, 40.685, -73.91695, 40.742)

# Filter the GeoDataFrame using the bounding polygon
gdf = gdf[gdf['centroid'].within(bounding_box)]

# Create a new column for building decade
gdf['decade'] = (gdf['cnstrct_yr'] // 10) * 10

# Remove rows where 'cnstrct_yr' is NaN
gdf = gdf[gdf['cnstrct_yr'].notna()]

# Get unique decades
unique_decades = sorted(gdf['decade'].unique())

# Use the Cividis colorscale and split it for our unique decades
colorscale = px.colors.sequential.Cividis
num_decades = len(unique_decades)
colors = [colorscale[i * len(colorscale) // num_decades] for i in range(num_decades)]
color_map = dict(zip(unique_decades, colors))

# Filter the data for buildings built in the 1930s and earlier
old_buildings = gdf[gdf['decade'] <= 1930]

# Create a new figure for better control
fig = go.Figure()

# Add the traces for each decade
for decade, color in color_map.items():
    subset = gdf[gdf['decade'] == decade]
    # Add the original trace with showlegend set to False
    fig.add_trace(go.Scattermapbox(
        lat=subset['centroid'].y,
        lon=subset['centroid'].x,
        mode='markers',
        marker=go.scattermapbox.Marker(
            size=3,
            color=color,
            opacity=0.8
        ),
        text=decade,
        name=str(int(decade)),
        hoverinfo='none',
        showlegend=False
    ))
    # Add a dummy trace with larger markers for the legend
    #   placed outside the visible map
    fig.add_trace(go.Scattermapbox(
        lat=[90],  # Latitude outside the visible map
        lon=[0],   # Longitude outside the visible map
        mode='markers',
        marker=go.scattermapbox.Marker(
            size=10,
            color=color,
            opacity=1
        ),
        legendgroup=str(int(decade)),
        showlegend=True,
        name=str(int(decade)),
        hoverinfo='none'
    ))

# Add heatmap for older buildings
fig.add_trace(go.Densitymapbox(
    lat=old_buildings['centroid'].y,
    lon=old_buildings['centroid'].x,
    radius=4,
    colorscale="Greens",
    opacity=1,
    name="Density of Prewar Buildings",
    showlegend=True,
    zmax=3,
    zmin=0,
    showscale=False
))

fig.update_layout(
    title='Buildings by Decade with Density Underlay for Prewar Buildings',
    autosize=True,
    mapbox=dict(
        accesstoken=None,
        bearing=0,
        center=dict(lat=40.71359, lon=-73.97216),
        pitch=0,
        zoom=12.6,
        style='carto-positron'
    ),
    height=800,
    width=1200,
    legend=dict(tracegroupgap=0)
)

# Display the map
fig.show()

Display of the streets and neighborhoods with high numbers of prewar buildings. Chart created with Plotly.

If you're familiar with NYC, some of this is not surprising and some might be a little more unexpected. The quintessential brownstone rows of Cobble Hill and the West Village are obvious. But I actually wasn't expecting such a high number of the buildings in Greenpoint and Williamsburg (the northern tip of Brooklyn) to be that old.

Building Sizes

Because we have the building footprints and roof heights in the datasets, we can calculate building volume. We can then visualize the average size of building by borough to compare.

To do this we'll use the geometry of the building footprint and the roof height to calculate building volume for each building in the data. We'll use the centroid of each footprint to place it in a borough.

# ...following from data pull code block

# Define borough bounding boxes
# These are very loose bounds and a more thorough analysis should use a higher
#   precision polygon. 
boroughs = {
    "Manhattan": box(-74.02, 40.70, -73.91, 40.88),
    "Bronx": box(-73.93, 40.80, -73.79, 40.92),
    "Brooklyn": box(-74.05, 40.57, -73.85, 40.74),
    "Queens": box(-73.94, 40.54, -73.70, 40.80),
    "Staten Island": box(-74.26, 40.50, -74.03, 40.65)
}

# Assign borough to each building based on its centroid
def assign_borough(centroid):
    for borough, bbox in boroughs.items():
        if bbox.contains(centroid):
            return borough
    return None

# Assuming the gdf variable already contains your data
gdf['borough'] = gdf['centroid'].apply(assign_borough)

# Calculate building volume using footprint area and height
gdf['volume'] = gdf['geometry'].area * gdf['heightroof']

# Compute average volume by borough
avg_volume_by_borough = gdf.groupby('borough')['volume'].median()

# Create 3D bar shapes using surface plots
def create_3d_bar(x, y, z, dx, dy, dz):
    # Define vertices of the bar
    x_data = [[x, x, x+dx, x+dx, x], [x, x, x+dx, x+dx, x]]
    y_data = [[y, y+dy, y+dy, y, y], [y, y+dy, y+dy, y, y]]
    z_data = [[z, z, z, z, z], [z+dz, z+dz, z+dz, z+dz, z+dz]]
    return go.Surface(
        x=x_data, 
        y=y_data, 
        z=z_data, 
        colorscale=[[0, 'blue'], [1, 'blue']], 
        showscale=False,
        opacity=0.5
    )

# Define bar dimensions
dx = 0.4
dy = 0.4

# Create figure
fig = go.Figure()

# Add bars to the figure
for i, borough in enumerate(avg_volume_by_borough.index):
    fig.add_trace(create_3d_bar(i, 0, 0, dx, dy, avg_volume_by_borough[borough]))

# Define the layout with adjusted aspect ratio for wider chart area
fig.update_layout(
    title='Average Building Volume by Borough',
    scene=dict(
        xaxis=dict(
            title='Borough',
            tickvals=list(range(len(avg_volume_by_borough))),
            ticktext=avg_volume_by_borough.index
        ),
        yaxis=dict(title='', visible=False),
        zaxis=dict(title='Average Building Volume (m^3)'),
        aspectratio=dict(x=3, y=2, z=1.5)  # Adjusting the aspect ratio for wider x-axis
    ),
    margin=dict(t=40, b=40, l=60, r=60)
)

fig.show()

Volumetric bar plot of building volume by borough. Chart created with Plotly.

No shocks here that Manhattan buildings are bigger. However, you might have expected bigger differences between the other three boroughs, perhaps especially between Brooklyn with its towers overlooking the East River and Staten Island with its distinctly NYC suburb looks.

All four of the outer boroughs have huge swathes of low density housing. Queens in particular is huge, being 56% larger than Brooklyn, the second largest borough.

We can see this in building size distributions with a boxenplot. Note that the volumes are logarithmic to better display the outer boroughs alongside Manhattan's dramatically higher number of massive buildings.

The large volume of small, single family homes in Queens is clear vs even Staten Island.

WiFi Hotspots

Dataset

NYC operates wifi hotspots across much of the city, and their locations are available through the open data portal. These are only city-operated hotspots, so the likes of Starbucks are not included.

We can read the JSON file directly from the portal side and create a map of locations.

# Define the URL
url = "https://data.cityofnewyork.us/resource/yjub-udmw.json"

# Send a GET request
response = requests.get(url)

# Load the response into a JSON
data = response.json()

# Convert the JSON data into a DataFrame
df = pd.DataFrame(data)

# Convert lat and lon columns to float
df['latitude'] = df['latitude'].astype(float)
df['longitude'] = df['longitude'].astype(float)

# Map the borough codes to borough names
borough_dict = {'1': 'Manhattan', '2': 'Bronx', '3': 'Brooklyn', '4': 'Queens', '5': 'Staten Island'}
df['borough'] = df['borough'].map(borough_dict)

# Replace the 'token' with your own Mapbox access token
px.set_mapbox_access_token('token')

fig = px.density_mapbox(
    df,
    lat='latitude',
    lon='longitude',
    zoom=10,
    mapbox_style="carto-positron",
    title="Distribution of WiFi Hotspots in NYC",
    radius=6
)

fig.update_layout(
    height=800,
    width=1200
)

fig.show()

Locations of City of New York operated wifi hotspots. Chart created with Plotly.

Most NYC subway stations have free wifi, and they show up really nicely on this map. The high-density areas of hotspots extending up both sides of Central Park are mostly subway stations.

You can also clearly see two of the main subway lines into Queens and Brooklyn by just looking for the locations arranged in unnaturally straight lines.

Squirrel Census

Dataset

The Squirrel Census (https://www.thesquirrelcensus.com/) is a science, design, and storytelling project focusing on the Eastern gray (Sciurus carolinensis). They count squirrels and present their findings to the public.

This data contains squirrel data for each of the 3,023 sightings, including location coordinates, age, primary and secondary fur color, elevation, activities, communications, and interactions between squirrels and humans.

Now let's ask a fun question.

Where in Central Park are the loudest squirrels?

There are three fields in the data denoting different noises/calls from a squirrel: ‘kuks', ‘quaas', and ‘moans'. We'll say any kind of noise means noisy, and create a density map to show us where the loud squirrels are hanging out.

# Pull data
data_url = 'https://data.cityofnewyork.us/api/views/vfnx-vebw/rows.csv?accessType=DOWNLOAD'
squirrels = pd.read_csv(
    data_url,
    usecols=['X', 'Y', 'Kuks', 'Quaas', 'Moans']
)

# Create column denoting that the squirrel made any kind of noise
squirrels['noisy'] = squirrels[['Kuks', 'Quaas', 'Moans']].any(axis=1)

# Filter out the quiet squirrels
noisy_squirrels = squirrels[squirrels['noisy']]

# Convert noisy column to integer
noisy_squirrels['noisy'] = noisy_squirrels['noisy'].astype(int)

# Create the density heatmap
fig = px.density_mapbox(
    noisy_squirrels, lat='Y', lon='X', z='noisy', radius=50,
    center=dict(lat=40.783, lon=-73.969),  # Center coordinates for Central Park
    zoom=13,
    mapbox_style="stamen-terrain",
    #mapbox_style="stamen-watercolor",
    color_continuous_scale=["white", "orange", "red"],
    range_color=[0, 5], # Adjusting the range for color scale
)

# Set the bearing to orient Central Park horizontally
fig.update_layout(
    mapbox_bearing=0,
    height=700,
    width=1000,
    title='Density of Noisy Squirrel Observations',
    showlegend=False,
    coloraxis_showscale=False
)

fig.show()

Density map of observations where a squirrel was making sounds. Chart created with Plotly.

Looks like the squirrels of the Upper West Side have the most to say.

Final Notes

There's a wealth of further data to explore on NYC Open Data, so make sure to check it out for yourself. Most large cities also have their own open data programs. Notable examples of extensive data programs include Los Angeles, Toronto, London, and Singapore.

NYC Open Data is public domain and so per the Open Data FAQs "there are no restrictions on the use of Open Data", including commercial use. See the terms of use for further detail.

The code for everything in this post is available on GitHub, and I've attached an MIT license to the repo so you can take and use anything you want.

Tags: Data Science Data Visualization New York City Programming Python