Why ETL-Zero? Understanding the shift in Data Integration

Author:Murphy  |  View: 28701  |  Time: 2025-03-22 19:42:07

When I was preparing for the Salesforce Data Cloud certification, I came across the term Zero-ETL. The Data Cloud offers the possibility to access data directly from other systems such as data warehouses or data lakes or sharing data with these systems without the data being copied. Salesforce describes this also as Bring Your Own Lake (BYOL), referring to the term Bring Your Own Device (BYOD). I wanted to better understand the concept of Zero-ETL and illustrate it in an understandable way.

In this article, I'll show you how you can create a simplified ETL process with Python to better understand this concept, what Zero-ETL or Zero-Copy means and how this new approach to data integration is implemented in the Salesforce Data Cloud.

Table of Content 1) Traditional ETL process: Step-by-step guide with Python for Beginners 2) So what is Zero-ETL? 3) Why Zero-ETL? Advantages and Disadvantages 4) What does Zero-ETL look like in the Salesforce Data Cloud? 5) Final Thoughts

1) Traditional ETL-Process: Step-by-Step Guide with Python for Beginners

If you are already familiar with the Etl and ELT processes, you can skip this section. If you are new to this topic, take a look at the super simplified example to better understand the Extract – Transform – Load process. Or even better, build it yourself – by applying it, you will usually understand the concepts better.

1 – Extract

In a traditional ETL data processing pipeline, the data is collected from a source such as a database, an API, a JSON file, an XML file or another data warehouse.

For our example, we will first create a CSV file containing customer data. I have put together a file with sample data that contains the columns ‘First Name', ‘Last Name', ‘Email', ‘Purchased _Product' and ‘Price_Paid'. You can find the CSV file and the code on GitHub.

We then read the CSV file with pandas and display the first 5 lines:

Python">import pandas as pd
import sqlite3
import matplotlib.pyplot as plt

# Step 1: Extract
# Reading data from the csv file
file_path = 'YOURPATH' # For Windows you have to separate your path with /
data = pd.read_csv(file_path)
print("Extracted data:")
print(data.head())

If you need help setting up a Python environment, it is best to read the steps in the article ‘Python Data Analysis Ecosystem – A Beginner's Roadmap ‘. I work with Anaconda and the Jupyter Lab for projects like this. For the code to work, you need to have installed the pandas, sqlite3 and matplotlib. If you are using Anaconda, you can enter the command ‘conda install pandas, sqlite, matplotlib' in your Anaconda prompt terminal.

2 – Transform

As soon as the data has been extracted, data transformations follow in the traditional ETL process. This can mean that column values are combined, calculations are performed, tables are merged or unnecessary information is removed.

For our example, we will carry out two simple transformations in this step. Firstly, we create a new column that stores the full name based on the first name and last name. Then, in a new column, we want to distinguish the customers who have spent a high amount from those who have spent a lower amount. To do this, we also create a new column (Boolean) that enters ‘Yes' for all rows with an amount over 20.

# Step 2: Transform
# Creating a new column Full_Name by combining First_Name and Last_Name
data['Full Name'] = data['First Name'] + ' ' + data['Last Name']

# Create a new column High_Payment with "Yes" if Paid_Price > 20, otherwise "No"
data['High_Payment'] = data['Price_Paid'].apply(lambda x: 'Yes' if x > 20 else 'No')

We display the first 5 lines again to check whether these transformations have been carried out successfully (two new columns Full Name and High_Payment):

# Displaying the 5 first rows
print("Transformed data:")
print(data.head())

3 – Loading

After transformation, the traditional ETL process involves loading the data into a platform for further analyses. For example, machine learning methods can be applied to the data or the data can be visualised for dashboards and reports.

For our example, we load the transformed data into an SQLite database in this step. SQLite is MySQL's little sister, so to speak, and is well-suited for simple projects with small to medium-sized data volumes. Here we also carry out a small analysis on the data and visualise it.

# Step 3: Load 
# Connecting to the SQLite database (or create it if it doesn't exist)
conn = sqlite3.connect('output_database.db')

# Loading the DataFrame into a new table in the SQLite database
data.to_sql('transformed_data', conn, if_exists='replace', index=False)

# Analysis: Identifying how many customers made high payments
high_payment_count = data[data['High_Payment'] == 'Yes'].count()
print("Number of High Payments:", high_payment_count['High_Payment'])

# Close the database connection
conn.close()

print("ETL process completed. Transformed data saved to 'output_database.db'.")

# Visualizing the data 
data['Price_Paid'].hist(bins=10)
plt.title('Distribution of Prices Paid')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

As you can see, the example is very simplified. Of course, much larger amounts of data are extracted in real projects, the transformations are usually much more complex and the data is typically loaded into systems such as other databases, data warehouses, data lakes or data visualisation tools.

So, what are some challenges of this traditional ETL process?

With this process, the data is not available in real-time but is usually processed and copied in batches. Furthermore, the process needs more resources and therefore more costs are consumed. This is where the term Zero-ETL comes into play.

2) So what is Zero-ETL?

We live in an age of instant. Every message, every movie, every song must be available immediately at any time – thanks, of course, to the success of WhatsApp, Netflix and Spotify, to name just a few examples.

This is exactly what cloud providers such as Amazon Web Services, Google Cloud and Microsoft Azure have told themselves: Data should be able to be processed and analysed almost in real-time and without major delays.

Zero-ETL is a concept from data integration. Instead of requiring the explicit extraction, transformation and loading of data in separate steps, as is traditionally the case, data should flow seamlessly between different systems. The term was introduced by AWS in 2022 for the integration of Amazon Aurora into Amazon Redshift.

What is new about this concept is that the technology makes it possible to use or analyse data directly in its original format and almost in real-time. There is no need to move data. Data latency is minimised. Data can be transformed and analysed within a single platform.

Imagine that traditional ETL is like collecting water in a bucket outside your house and then carrying it to your shower. Not only does this process take time and energy, but you may also spill water on the way. Okay, that's how I showered when I spent 4 months in Tanzania 10 years ago

Tags: Cloud Computing Data Lake Etl Getting Started Python

Comment