Add One Line of SQL to Optimise Your BigQuery Tables

Author:Murphy  |  View: 26625  |  Time: 2025-03-22 23:44:52

In my previous article, I explained how to optimise SQL queries using partitioning:

Use the Partitions, Luke! A Simple and Proven Way to Optimise Your SQL Queries

Now, I'm writing the sequel! (Dad joke, anyone?)

This article will look at clustering: another powerful optimisation technique you can use in Bigquery. Like partitioning, clustering can help you write more performant queries that are quicker and cheaper to run. If you want to develop your SQL toolkit and build those higher-level Data Science skills, this is a great place to start.

What's a clustered table?

In BigQuery, a clustered table is a table that keeps similar rows grouped together in physical "blocks".

For example, picture a table called user_signups that keeps track of all the people registering an account on a fictitious website. It's got four columns:

  • registration_date: the date on which the user created an account
  • country: the country where the user is based
  • tier: the user's plan ("Free" or "Paid")
  • username: the user's username

If we wanted, we could cluster the table by country so that users from the same country are stored nearby each other in the table:

Image by author

As you can see, each "block" within the table contains users from a particular country. The clustered table still contains the same data; it's just ordered in a more efficient way.

Clustering speeds up our queries because it means BigQuery has to process less data

When you query a clustered table, BigQuery will first identify the relevant blocks needed to execute the query. This preliminary step – known as block pruning – makes your queries quicker and cheaper to run because BigQuery doesn't perform unnecessary operations on the irrelevant blocks. It only uses the blocks/data it actually needs.

To see the benefits of this, let's imagine that our user_signups table contains 1,000,000 rows, and let's say we wanted to fetch the users that are from Lebanon and signed up on 2023-12-01. We'd write:

Sql">SELECT *
FROM user_signups
WHERE
  country = 'Lebanon'
  AND registration_date = '2023–12–01'

When we run that code, BigQuery will (in theory) start by finding the relevant block (in this case, that's the block containing users from Lebanon), and then filter that block to find the rows where registration_date equals "2023-12-01". It won't need to read ALL the rows in the table; just the ones in the relevant cluster.

IMPORTANT: Clustering won't always do what you expect (especially for small tables)

Image by Madison Oren on Unsplash

BigQuery is smart.

It knows that creating (and pruning) clusters/blocks uses computing power, and that sometimes the effort required to manage these clusters is too high to bring any performance gains.

For this reason, BigQuery won't necessarily create a new block for each distinct value in the clustering column (1). As Data Engineer Alessandro writes,

"Clusters are not like partitions. In fact there is no guarantee that there will be one cluster per column value… This is also why BigQuery cannot give you a good estimation of how much data the query will use before running it (like it does for partitions)."

Think about it this way: if your entire table only consists of 10 rows, it's probably quicker to scan the entire table than to go through the process of block pruning and managing clusters. BigQuery knows this, so it doesn't waste resources on clustering.

How big does your table need to be in order to get the benefits of clustering? Well, according to one ex-Google engineer,

"If you have less than 100MB of data per [group you want to cluster by], clustering won't do much for you"

At risk of stating the obvious, it's also worth noting that clustering won't help your queries if you query on a non-clustering column (e.g., if I didn't include the WHERE country = ... filter in my query above, we wouldn't be able to apply block pruning on the country column).

Creating clustered tables

It's very easy to create a clustered table. Simply add a CLUSTER BY clause at the end of your CREATE TABLE statement:

CREATE TABLE `myproject.mydataset.clustered_table` (
  registration_date DATE,
  country STRING,
  tier STRING,
  username STRING
) CLUSTER BY country; # Add this

You can cluster by up to four columns, and (unlike with partitions) you're not limited to INT64 or DATE columns; you can also cluster by columns with data types like STRING and GEOGRAPHY.

Here's what it would look like to cluster our user_signups table by two columns:

Image by author

Combine clustering with partitioning for optimal performance

In my previous article, I wrote about partitioning: a way to split your table into physical partitions based on dates or integers.

Luckily, it's very easy to combine clustering with partitioning.

Here's an example showing our user_signups table, now partitioned by registration_date and clustered by country:

Image by author

When you query the table, BigQuery will first try to apply partition pruning (to identify the relevant partitions) and then apply block pruning within the relevant partitions in order to find the relevant rows.

This is one of the reasons I love working with BigQuery so much: it gives you the best of both worlds!

Want a place to practice SQL?

If you enjoyed this article, you might like my site the-sql-gym.com, which contains over 100 practice SQL questions. If you want to boost your SQL skills, check it out!

Thanks for reading, and feel free to connect with me on Twitter or LinkedIn!

Tags: Bigquery Data Analysis Data Engineering Data Science Sql

Comment