How to Compare Two Tables For Equality in BigQuery

Author:Murphy  |  View: 26333  |  Time: 2025-03-23 20:01:50

Comparing tables in BigQuery is a crucial task when testing the results of data pipelines and queries prior to productionizing them. The ability to compare tables allows for the detection of any changes or discrepancies in the data, ensuring that the data remains accurate and consistent.

In this article we will demonstrate how to compare two (or more) tables on BigQuery and extract the records that differ (if any). More specifically, we will showcase how to compare tables with identical columns as well as tables with a different amount of columns.


First, let's start by creating two tables with some dummy values that we will then be referencing throughout this tutorial in order to demonstrate a few different concepts.

Sql">-- Create the first table
CREATE TABLE `temp.tableA` (
  `first_name` STRING,
  `last_name` STRING,
  `is_active` BOOL,
  `no_of_purchases` INT
)
INSERT `temp.tableA` (first_name, last_name, is_active, no_of_purchases)
VALUES 
  ('Bob', 'Anderson', True, 12),
  ('Maria', 'Brown', False, 0),
  ('Andrew', 'White', True, 4)

-- Create the second table
CREATE TABLE `temp.tableB` (
  `first_name` STRING,
  `last_name` STRING,
  `is_active` BOOL,
  `no_of_purchases` INT
)
INSERT `temp.tableB` (first_name, last_name, is_active, no_of_purchases)
VALUES 
  ('Bob', 'Anderson', True, 12),
  ('Maria', 'Brown', False, 0),
  ('Andrew', 'White', True, 6),
  ('John', 'Down', False, 0)

Comparing records of tables with the same columns

Now that we have created our two example tables, you should have noticed that there are a couple of differences between them.

SELECT * FROM `temp.tableA`;

+------------+-----------+-----------+-----------------+
| first_name | last_name | is_active | no_of_purchases |
+------------+-----------+-----------+-----------------+
| Bob        | Anderson  | true      | 12              |
| Andrew     | White     | true      | 4               |
| Maria      | Brown     | false     | 0               |
+------------+-----------+-----------+-----------------+
SELECT * FROM `temp.tableB`;

+------------+-----------+-----------+-----------------+
| first_name | last_name | is_active | no_of_purchases |
+------------+-----------+-----------+-----------------+
| Bob        | Anderson  | true      | 12              |
| Andrew     | White     | true      | 6               |
| Maria      | Brown     | false     | 0               |
| John       | Down      | false     | 0               |
+------------+-----------+-----------+-----------------+

Now assuming that table temp.tableB is the latest version of some dataset whereas temp.tableA is an older one and we would like to see the actual differences (in terms of records) between the two tables, all we need is the following query:

WITH
  table_a AS (SELECT * FROM `temp.tableA`),
  table_b AS (SELECT * FROM `temp.tableB`),
  rows_mismatched AS (
    SELECT
      'tableA' AS table_name,
      *
    FROM (
      SELECT
        *
      FROM
        table_a EXCEPT DISTINCT
      SELECT
        *
      FROM
        table_b 
    )

    UNION ALL

    SELECT
      'tableB' AS table_name,
      *
    FROM (
      SELECT
        *
      FROM
        table_b EXCEPT DISTINCT
      SELECT
        *
      FROM
        table_a 
    )
  )

SELECT * FROM rows_mismatched

Now the result will contain all the differences observed between the tables along with a reference to the table name where the records were found.

In our specific examples, tables A and B were having a difference in two records; The first one seems to be the record for Andrew White since this person has a different value for no_of_purchases field. Additionally, table tableB has one additional record that is not even present on table tableA.

+------------+------------+-----------+-----------+-----------------+
| table_name | first_name | last_name | is_active | no_of_purchases |
+------------+------------+-----------+-----------+-----------------+
| tableB     | John       | Down      | false     | 0               |
| tableB     | Andrew     | White     | true      | 6               |
| tableA     | Andrew     | White     | true      | 4               |
+------------+------------+-----------+-----------+-----------------+

Note: If you are not familiar with the WITH clause and Common Table Expressions (CTEs) in SQL, make sure to read the following article:

What are CTEs in SQL


Comparing records of tables with different columns

Now let's suppose you would like to compare the records between two tables having a different amount of columns. Obviously, we would have to do an apples-to-apples comparison meaning that we somehow need to extract only the common fields from both tables in order to be able to perform a meaningful comparison.

Let's re-create our tables in order to generate some mis-matching columns so that we can then demonstrate how to deal with these cases:

-- Create the first table
CREATE TABLE `temp.tableA` (
  `first_name` STRING,
  `last_name` STRING,
  `is_active` BOOL,
  `dob` STRING
)
INSERT `temp.tableA` (first_name, last_name, is_active, dob)
VALUES 
  ('Bob', 'Anderson', True, '12/02/1993'),
  ('Maria', 'Brown', False, '10/05/2000'),
  ('Andrew', 'White', True, '14/12/1997')

-- Create the second table
CREATE TABLE `temp.tableB` (
  `first_name` STRING,
  `last_name` STRING,
  `is_active` BOOL,
  `no_of_purchases` INT
)
INSERT `temp.tableB` (first_name, last_name, is_active, no_of_purchases)
VALUES 
  ('Bob', 'Anderson', True, 12),
  ('Maria', 'Brown', True, 0),
  ('Andrew', 'White', True, 6),
  ('John', 'Down', False, 0)

Now our new tables have only three columns in common, namely first_name, last_name and is_active.

SELECT * FROM `temp.tableA`;

+------------+-----------+-----------+--------------+
| first_name | last_name | is_active | dob          |
+------------+-----------+-----------+--------------+
| Bob        | Anderson  | true      | '12/02/1993' |
| Andrew     | White     | true      | '10/05/2000' |
| Maria      | Brown     | false     | '14/12/1997' |
+------------+-----------+-----------+--------------+
SELECT * FROM `temp.tableB`;

+------------+-----------+-----------+-----------------+
| first_name | last_name | is_active | no_of_purchases |
+------------+-----------+-----------+-----------------+
| Bob        | Anderson  | true      | 12              |
| Andrew     | White     | true      | 6               |
| Maria      | Brown     | false     | 0               |
| John       | Down      | false     | 0               |
+------------+-----------+-----------+-----------------+

Now if we attempt to run the query we executed in the previous section where the two tables were having the same columns, we will end up with this error:

Column 4 in EXCEPT DISTINCT has incompatible types: STRING, INT64 at [13:7]

This is absolutely normal given that our tables no longer have matching columns. We need to slightly amend our initial query such that the very first CTEs will only select the mutual columns for every table. Our query will look as below:

WITH
  table_a AS (
    SELECT 
      first_name,
      last_name,
      is_active
    FROM 
      `temp.tableA`
  ),
  table_b AS (
    SELECT 
      first_name,
      last_name,
      is_active 
    FROM 
      `temp.tableB`
  ),
  rows_mismatched AS (
    SELECT
      'tableA' AS table_name,
      *
    FROM (
      SELECT
        *
      FROM
        table_a EXCEPT DISTINCT
      SELECT
        *
      FROM
        table_b 
    )

    UNION ALL

    SELECT
      'tableB' AS table_name,
      *
    FROM (
      SELECT
        *
      FROM
        table_b EXCEPT DISTINCT
      SELECT
        *
      FROM
        table_a 
    )
  )

SELECT * FROM rows_mismatched

The tables created in this section were having the following mismatches (when considering only their mutual columns):

  • The record for Maria Brown has differences in column is_active
  • Table tableB has one additional record (John Down) which is not present in tableA

These differences can be observed in query results shared below:

+------------+------------+-----------+-----------+
| table_name | first_name | last_name | is_active |
+------------+------------+-----------+-----------+
| tableB     | Maria      | Brown     | false     |
| tableB     | John       | Down      | false     | 
| tableA     | Maria      | Brown     | true      | 
+------------+------------+-----------+-----------+

Final Thoughts

In this article, we provided a comprehensive guide on how to compare tables in BigQuery. We highlighted the importance of this task in ensuring the accuracy and consistency of data and demonstrated multiple techniques for comparing tables with identical columns as well as tables with different amounts of columns. We also walked through the process of extracting records that differ between tables (if any).

Overall, this article aimed to equip readers with the necessary tools and knowledge to effectively and efficiently compare tables in BigQuery. I hope you found it useful!


Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You'll also get full access to every story on Medium.

Join Medium with my referral link – Giorgos Myrianthous


Related articles you may also like

ETL vs ELT: What's the Difference?


What is dbt (data build tool)

Tags: Data Engineering Data Science Google Cloud Platform Programming Sql

Comment