How I Dockerized Apache Flink, Kafka, and PostgreSQL for Real-Time Data Streaming

Author:Murphy | View: 25655 | Time: 2025-03-22 21:09:23

Get your pyFlink applications ready using docker – author generated image using https://www.dall-efree.com/

Why Read This?

Real-World Insights: Get practical tips from my personal journey of overcoming integration hurdles.
Complete Setup: Learn how to integrate Flink, Kafka, and PostgreSQL seamlessly using Docker-Compose.
Step-by-Step Guide: Perfect for both beginners and experienced developers looking to streamline their data streaming stack.

Setting Up the Scene

I embarked on a mission to integrate Apache Flink with Kafka and PostgreSQL using Docker. What makes this endeavor particularly exciting is the use of pyFlink – the Python flavor of Flink – which is both powerful and relatively rare. This setup aims to handle real-time data processing and storage efficiently. In the following sections, I'll demonstrate how I achieved this, discussing the challenges encountered and how I overcame them. I'll conclude with a step-by-step guide so you can build and experiment with this streaming pipeline yourself.

The infrastructure we'll build is illustrated below. Externally, there's a publisher module that simulates IoT sensor messages, similar to what was discussed in a previous post. Inside the Docker container, we will create two Kafka topics. The first topic, sensors, will store incoming messages from IoT devices in real-time. A Flink application will then consume messages from this topic, filter those with temperatures above 30°C, and publish them to a second topic, alerts. Additionally, the Flink application will insert the consumed messages into a PostgreSQL table created specifically for this purpose. This setup allows us to persist sensor data in a structured, tabular format, providing opportunities for further transformation and analysis. Visualization tools like Tableau or Power BI can be connected to this data for real-time plotting and dashboards.

Moreover, the alerts topic can be consumed by other clients to initiate actions based on the messages it holds, such as activating air conditioning systems or triggering fire safety protocols.

Services included in the docker container – image by author

In order to follow up the tutorial, you can clone the following repo. A docker-compose.yml is placed in the root of the project so you can initialize the multi-container application. Furthermore, you can find detailed instructions in the README file.

Issues With Kafka Ports in docker-compose.yml

Initially, I encountered problems with Kafka's port configuration when using the confluentinc Kafka Docker image, a popular choice for such setups. This issue became apparent through the logs, emphasizing the importance of not running docker-compose up in detached mode (-d) during initial setup and troubleshooting phases.

The reason for the failure was that the internal and external hosts were using the same port, which led to connectivity problems. I fixed this by changing the internal port to 19092. I've found this blog post pretty clarifying.

KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:19092,PLAINTEXT_HOST://localhost:9092

Configuring Flink in Session Mode

To run Flink in session mode (allowing multiple jobs in a single cluster), I'm using the following directives in the docker-compose.yml.

Custom Docker Image for PyFlink

Given the limitations of the default Apache Flink Docker image, which doesn't include Python support, I created a custom Docker image for pyFlink. This custom image ensures that Flink can run Python jobs and includes the necessary dependencies for integration with Kafka and PostgreSQL. The Dockerfile used for this is located in the pyflink subdirectory.

Base Image: We start with the official Flink image.
Python Installation: Python and pip are installed, upgrading pip to the latest version.
Dependency Management: Dependencies are installed via requirements.txt. Alternatively, lines are commented to demonstrate how to manually install dependencies from local files, useful for deployment in environments without internet access.
Connector Libraries: Connectors for Kafka and PostgreSQL are downloaded directly into the Flink lib directory. This enables Flink to interact with Kafka and PostgreSQL during job execution.
Script Copying: Scripts from the repository are copied into the /opt/flink directory to be executed by the Flink task manager.

With this custom Docker image, we ensure pyFlink can run properly within the Docker container, equipped with the necessary libraries to interact with Kafka and PostgreSQL seamlessly. This approach provides flexibility and is suitable for both development and production environments.

Note: Ensure that any network or security considerations for downloading connectors and other dependencies are addressed according to your deployment environment's policies.

Integrating PostgreSQL

To connect Apache Flink to the PostgreSQL database, a proper JDBC connector is required. The custom Docker image for pyFlink downloads the JDBC connector for PostgreSQL, which is compatible with PostgreSQL 16.

To simplify this process, a download_libs.sh script is included in the repository, mirroring the actions performed in the Flink Docker container. This script automates the download of the necessary libraries, ensuring consistency between the Docker and local environments.

Note: Connectors usually have two versions. In this particular case, since I'm using Flink 1.18, the latest stable version available, I've downloaded 3.1.2–1.18. My guess is that the first version tracks JDBC implementation for several databases. They're available in the maven directory.

env.add_jars(
  f"file://{current_dir}/flink-connector-jdbc-3.1.2–1.18.jar",
  f"file://{current_dir}/postgresql-42.7.3.jar"
)

Defining JDBC Sink

In our Flink task, there's a crucial function named configure_postgre_sink located in the usr_jobs/postgres_sink.py file. This function is responsible for configuring a generic PostgreSQL sink. To use it effectively, you need to provide the SQL Data Manipulation Language (DML) statement and the corresponding value types. The types used in the streaming data are defined as TYPE_INFO … it took me a while to come up with the correct declaration

Tags: Data Engineering Data Streaming Docker Flink Pyflink