Multi-regional source of truth

Author:Murphy | View: 25610 | Time: 2025-03-23 19:24:44

I created a post a year ago about migration from Postgres to Snowflake, and here is another case of migration. This time I'm going to concentrate on the business case, architecture, and design rather than the technical aspects but I'll also try to share some tips that might be helpful.

The business case

It's a B2B (Business to Business) business. The company is an AaaS (Application as a Service) provider. It has its software product available as an online service for customers in multiple countries across the globe. It requires a BI (Business Intelligence) system mostly to give customers access to the data. Here are the most interesting requirements:

Some customers need access to ready-to-use dashboards and visualizations, and some want to pull the data into their DWH.
Since the customers operate in different countries, the data should stay in the region it was initially created in
The customers should see only their information. For the same region, the data is stored within the same database and the same tables, so row-level security must be applied
In the majority of cases, the customers want real-time data.

Let me start with what the existing architecture looked like:

The key characteristics are read-only replicas to pull the data, no DWH, and Looker on top of it. The obvious advantage of the solution is that it's the quickest one to set up. Replica, in general, has real-time data. Looker takes care of transformations, semantic layer, and visualizations. The row-level security is also implemented in Looker using different filters. Unfortunately, that's probably the only advantage, and it has a lot of drawbacks.

Very limited ability to apply any modern data engineering techniques. We can't save the transformed data, we can't use proper tools to do the transformation, implement pipelines, auto-tests, etc. Basically, the only available instrument is a Looker Derived Tables functionality.
As there is no DWH, it's impossible to pull data from other sources like CRM, mix the data from different services, or upload any static dictionary. The BI is limited to the data that is saved there by one particular software product.
As I've mentioned earlier, some customers need to pull the data and they'd like to have as low-level and raw data as possible. Looker does a great job delivering the reports but it works on top of its semantic layer which already has some joins, not all the fields are available and each Looker explore is designed for a specific business process. The structure is not optimized for export.
Performance. This is probably the biggest problem as the OLTP database is not designed for analytical purposes and we don't have control over indexes as we work with read-only replicas. It leads not only to initial slow performance but also to performance degradation as without the indexes the more data we have the slower the queries are.

The solution

The most straightforward way to address those drawbacks was to introduce a DWH layer which we did. The new design looked like this:

Streaming data

There are various providers on the market to do the streaming, we've chosen Fivetran but one can have a look at Stitch, Airbyte, or any open-source framework like Meltano or poor Singer. We've also considered Google Storage Transfer Service, but it was in the raw state and didn't have enough flexibility.

The whole process of the streaming setup wasn't without issues, of course. Here are some of them:

The OLTP database didn't have public access so a VPN had to be set up for Fivetran to pull the data. It involved some network configuration and required additional back and forth to decide on the solution that worked for both Fivetran and us
Fivetran adds to the database load, so we had to setup additional monitoring, and make sure that the stream doesn't affect the main functionality
The initial plan was to sync data every 5 minutes but it was causing connections to time out and was throwing errors in logs due to the big volume of changes. Fivetran was able to recover so it was no data loss but it was generating a mess that was making it difficult to distinguish between real errors and these overload errors. There was no straightforward solution but to relax requirements related to having real-time data and start pulling the data less frequently.

Transforming data

The BigQuery was chosen mainly because it satisfies all the requirements we had and the majority of the existing infrastructure is also in Google cloud so the transition was supposed to be smooth.

Since we now have DWH, which is not read-only, we were able to utilize the DBT tool to perform data transformation and mix data from different sources, but even without any transformations, with Looker querying the same data as before but from BigQuery, the performance has improved dramatically, sometimes more than 100 times for the heavy tables.

I won't stop on the data preparation for the Looker as it's very domain specific. What's more interesting is how we solved the data export problem. In general, it's very simple, we've just given customers direct access to BigQuery but the devil is in the details so let me share some of it.

The easiest requirement to implement was keeping data in different regions. Fortunately, the region for each Dataset in BigQuery is configurable. Fivetran streams data into different datasets and DBT models are run against these datasets separately so the data from different regions don't get mixed at all.

It was a bit more complicated to implement row-level security to make sure that the customers have access to their data only. BigQuery supports row-level security, but since we use DBT we've chosen a different approach to have better control over the exposing data and more options to automate the process.

We've created a set of DBT models that apply security in SQL code based on the information about the currently connected user and create views in BigQuery. More precisely:

Each customer who needs access is provided with the user/service account. The IAM role has general access to BigQuery, but there is no access to any specific dataset, so, initially, a new user can't query the data
The new user email is added to the DBT seed files along with the identification of the customer. Files for different regions are kept separately. It gives a user-customer mapping as well as information on what dataset the user should have access to. The seed files are stored in GitHub and therefore, we can run a GitHub action on every change in this file
The GitHub Action runs the parametrized DBT project against different regions. The regions and data are different but the structure is the same so we needed to write SQL code just once per each view. The DBT uploads seed files, creates views, and as a final step gives individual users read-only access to the needed dataset
The views identify a currently logged-in user email using the session_user() function, join it with the user-customer mapping, and filter out everything which is not related to the customer from the mapping.

One very specific case over here that I wanted to share is giving access to oAuth users. In Google IAM there are two types of users: service account and oAuth user. It's easy to work with the first one but not all the downstream systems support this type of access. E.g., Tableau online can't work with the service account. The second type is a way a normal user logs into the Google account. It's very poorly documented how to make sure that such a user has access to the specific datasets only, so here are the tips:

A user should have all the permissions of the BigQuery Job user and a bigquery.jobs.create permission additionally
The dataset-level permission can be granted then the same way it works for the service account, e.g. with the SQL GRANT command
A user doesn't necessarily need to be created by the same Google account. E.G. people from other organizations can get access provided their IAM permissions are configured correctly in our project
Be careful with the group emails. Everything works about the access on the dataset level, but the session_user() returns the user's individual email so the row-level security won't work with the group email

Last but not least tip to mention is that since we the clients have access to the views, not the tables, the views should be able to query the data from other data sets without exposing the raw data to the client. BigQuery has such an ability, one can allow all the views from one dataset to query the data from the other dataset regardless of the end user permissions.

BigQuery monitoring

Since the customers have direct access to the DWH it's important to know how much each of the customers utilizes the functionality. Since each customer is tied to the user the monitoring on the user level should be able to answer this question. I've come across an interesting article on how to do it and we've just re-implemented the same approach in Looker.

The other part of the infrastructure is Github, more precisely GitHub actions. This functionality allows to create an execution workflow and set up a schedule for it. There are existing open-source actions like this that can help implement it, and the only thing to pay for is the GitHub minutes.

Single source of truth

The last thing I'd like to mention is the solution's flexibility. I've described the main and the most interesting case but the company has other services and with the current solution we were able to move the analytics of all these services into our DWH system and use it as a source of truth for the downstream systems. In other words, it also serves as a data feed. DBT is used to produce the views and tables for all the business cases and the Dataset level access allows separation of the downstream systems and users on the data level.

The DBT flexibility allows creating models for different downstream systems within one project which is automated by means of GitHub as I mentioned earlier.

Tags:

Add Fav