ETL vs ELT vs Streaming ETL

Author:Murphy  |  View: 25312  |  Time: 2025-03-23 12:53:35

Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) are two fundamental concepts in the context of data processing, used to describe data ingestion and transformation design paradigms. While these terms are often used interchangeably, they refer to slightly different concepts and are applicable to different use cases that also impose varying designs.

In this article, we will explore the differences and similarities of both ETL and ELT and discuss how the landscape in cloud computing and Data Engineering has affected data processing design patterns. Furthermore, we will outline the main advantages and disadvantages both have to offer in modern data teams. Lastly, we will discuss Streaming ETL, an emerging data-processing pattern that aims to solve various disadvantages of more traditional batch approaches.


Subscribe to Data Pipeline, a newsletter dedicated to Data Engineering


The three steps of interest

Ingesting and persisting data from external sources into a destination system involves three distinct steps.

ExtractThe ‘Extract' step involves all processes required in order to pull data from a source system. Such sources include an Application Programming Interface (API), a database system or a file, and Internet of Things (IoT) devices while the data can be in any form; structured, semi-structured or unstructured. Data pulled during this step are usually referred to as ‘raw data'.

Pulling raw data from a source system as part of the ‘Extract' step in ETL/ELT pipelines – Source: Author

TransformDuring the ‘Transform' step, the pipeline applies transformations on top of the raw data in order to achieve a certain goal. This goal is usually related to business or technical requirements. Some commonly applied transformations include data modification (e.g. mapping United States to US), record or attribute selection, joins into other data sources or even data validations.

Applying transformation on raw data to achieve a certain goal as part of the ‘Transform' step in ETL/ELT pipelines – Source: Author

LoadDuring the ‘load' step, the data (either raw or transformed) are loaded into a destination system. Usually, the destination is an OLAP system (i.e. a Data Warehouse or Lake) but sometimes it can even be an OLTP system, such as an application Database.

Loading data into a destination system as part of the ‘Load' step in ETL/ELT pipelines – Source: Author

Evidently, both ETL and ELT paradigms involve the aforementioned steps. Their key difference aligns with the order they execute them which in turn has significant implications in costs, computing power and design patterns. Therefore, the most appropriate paradigm should be selected based on the advantages and disadvantages each has to offer. It is also important to mention that team structure itself could be a good indicator as to which paradigm must be chosen. Let's get into the details to make this even more clear!


Extract Transform Load (ETL)

ETL refers to the process that involves extraction from the source system (or file), followed by the transformation step that modifies the extracted raw data and finally the loading step that ingests the transformed data into the destination system.

The sequence of execution in Extract-Transform-Load (ETL) pipelines – Source: Author

ETL Steps:

  1. Extract raw data from the source system
  2. Transform the data in a staging server (i.e. external to the destination system/server)
  3. Load data into the destination system

ETL has been around for many years and gained significant traction back in the days when companies started using Data Warehouses to store their data in a centralised point within their technological ecosystem.

ETL is still relevant in the data world but when it comes to batch processing, it demonstrates two significant disadvantages. The first one is related to the fact that the raw data will never make it to the destination system. This means that whenever a new transformation is required, a new ETL job must be executed in order to fetch the data from the source (once again!), apply the new transformation rules and finally load it into the destination system. The second major disadvantage is related to the fact that the transformation usually occurs in a staging server.


Extract Load Transform (ELT)

With the emergence of cloud technologies and the significant reduction of data storage costs, modern organisations are now able to store enormous amounts of raw data and then apply any transformations or analyses, as required.

These changes in the technological landscape have also shifted the way data teams ingest and analyse data. ELT paradigm enables teams to load the data into the destination system, where any transformations can then be made.

The sequence of execution in Extract-Load-Transform (ELT) pipelines – Source: Author

ELT Steps:

  1. Extract raw data from the source system
  2. Load extracted raw data into the destination system
  3. Transform raw data within the destination system

With ELT, there's no need to maintain (and pay) for a staging server where the transformations would normally take place. Additionally, raw data is available at any given time which means further transformations can be performed in a blink of an eye.


Streaming ETL

Streaming ETL refers to a concept where the ETL paradigm is performed continuously. As the data comes in (as part of the ‘Extract' step), the transformations – including joining, mapping, filtering, and windowing – are performed over streaming data. Processed data are then pushed into the downstream destination system, such as a database, a warehouse or even a message queue.

While both ETL and ELT paradigms are still quite relevant and useful, they always occur as batch processes. This means that latency is increased while data freshness is decreased. On the other hand, Streaming ETL offers numerous advantages over batch jobs including lower latency, increased data freshness as well as more flexibility (both in terms of the resource usage as well as the data processing itself) and scalability.

As an example let's consider fraud detection which is applicable to many different industries and especially financial services. In banking, fraud can be performed using stolen cards, forge or even phishing calls. Detecting fraudulent activity is time-critical since financial institutions need to perform certain actions (e.g. freezing an account or sending a notification) upon their detection. Although many factors should be taken into account when it comes to detecting malicious transactional activity, an obvious and key factor that can potentially indicate fraud is whether the same account has been used to perform transactions within a few minutes, but within different geographical locations (whose distance doesn't reasonate the time difference in between the transactions of interest). To do so, we could take advantage of time windows.

Evidently, batch processing – that can be executed in the form of ELT or ELT pipelines – would be incapable of serving time-critical use cases such as fraud detection. And this is exactly where stream processing comes into play.


Stream Processing

Stream processing refers to the practice of continuous processing of data at the time it is created. Typically, a stream processing pipeline involves the ingestion or generation of the raw data, the processing performed over it, including aggregations, filtering, windowing, enrichment, etc. and eventually, the delivery of the processed stream into a destination system.

High level overview of stream processing pipeline – Source: Author

Batch processing has been the status quo for a long time but the rapid increase of data generation across the globe along with the appearance of technologies that started challenging batch architectures have changed the landscape since then.

Evidently, the primary advantage of stream processing is its ability to process data at real-time. This enables users and organisations to perform faster analysis and thus decision making.

Furthermore, stream processing systems can scale much easier than batch processes since they are designed to handle huge volumes of data in real-time. This is especially important for use cases incorporating data flows with spikes (in terms of volume), such as Internet-of-Things applications.

In contrast to batch jobs, stream processing can reduce data processing costs related to data storage. Streaming jobs process only the subset of data that is required to be processed without having to store and maintain large datasets consisting of unused data.


Applying Streaming ETL with Streaming Databases

There are different flavours of Streaming Databases that provide various different features and support different SQL flavours. One such streaming database is RisingWave that I will be using as an example to describe some Streaming ETL concepts, due to my familiarity with Postgres SQL.

RisingWave is a distributed SQL database for stream processing and it is designed to enable a seamless experience when it comes to building real-time and event-driven applications.

The tool can work with data consumed from sources like Apache Pulsar, Apache Kafka, Kinesis and CDC sources. Likewise, data from RisingWave can then be ingested into downstream systems, including data warehouses, data lakes and message brokers/queues.

The platform can be used to consume data from the aforementioned sources, perform incremental computations as new data comes in and update results dynamically. Due to the fact that RisingWave is a database system itself, users are able to access data efficiently and effectively.

Evidently, one of the many use cases that RisingWave supports is Streaming ETL. The data can be extracted from various sources and ingested into the streaming database, where transformations can be applied. The resulting processed/transformed data is then written into the desired destination system on a continuous base.

Streaming archtecture with RisingWave – Source: RisingWave GitHub repository (Licensed under Apache License)

Streaming ETL steps using RisingWave

  1. Extract: RisingWave enables the ingestion from various data sources, including log files, message queues, and CDC streams coming from database logs
  2. Transform: It provides all the functionality required in order to perform certain transformations over streaming data, including mapping, aggregation, filtering, windowing, joining and stateful processing
  3. Load: Finally, the processed data streams can then be loaded from RisingWave into the downstream destination system

Some use cases that can be greatly benefited from the Streaming ETL capabilities include:

  • Continuous data integration
  • Internet-of-Things (IoT) data processing
  • Real-time fraud detection
  • Real-time alerting

An alternative solution for Streaming ETL is also FlinkSQL that allows usesrs to develop streaming applications using SQL. Databricks and Snowflake have also recently started providing support for streaming applications and could also be good alternative options.


Final Thoughts…

Designing data processing pipelines is a challenging task given that there are no one-size-fits-all solutions. Therefore, one should familiarise themselves with fundamental concepts in order to be able to make valuable considerations when it comes to implementing a solution that solves a specific and niche problem.

In today's article, we discussed two commonly used paradigms in the context of data processing, namely ETL and ELT. Furthermore, we discussed how modern organisations can take advantage of real-time capabilities and implement event-driven architectures using Streaming ETL using RisingWave, a distributed SQL database used for stream processing that can support real-time analytics, event-driven applications and Streaming ETL among others.


Subscribe to Data Pipeline, a newsletter dedicated to Data Engineering

Tags: Coding Data Engineering Data Science Programming Python

Comment