Integrating Apache Flink with Kafka and PostgreSQL Using Docker

Integrating pyFlink, Kafka, and PostgreSQL using Docker offers a fascinating journey into the world of real-time data processing. This setup is not just about connecting different technologies but ensuring they work seamlessly together to handle data efficiently. Here’s a detailed look at how this integration can be achieved, along with some practical insights and solutions to common issues.

Setting Up the Scene

The mission to integrate Apache Flink with Kafka and PostgreSQL using Docker is particularly exciting due to the use of pyFlink — the Python flavor of Flink. This setup aims to handle real-time data processing and storage efficiently. The infrastructure includes a publisher module that simulates IoT sensor messages. Inside the Docker container, two Kafka topics are created:

sensors: Stores incoming messages from IoT devices in real-time.
alerts: Receives filtered messages with temperatures above 30°C.

A Flink application consumes messages from the sensors topic, filters those with temperatures above 30°C, and publishes them to the alerts topic. Additionally, the Flink application inserts the consumed messages into a PostgreSQL table, allowing for structured data storage and further analysis. Visualization tools like Tableau or Power BI can connect to this data for real-time plotting and dashboards. The alerts topic can also be consumed by other clients to initiate actions based on the messages it holds, such as activating air conditioning systems or triggering fire safety protocols.

Issues With Kafka Ports in docker-compose.yml

Initially, I encountered problems with Kafka’s port configuration when using the confluentinc Kafka Docker image. This issue became apparent through the logs, emphasizing the importance of not running docker-compose up in detached mode (-d) during initial setup and troubleshooting phases. The failure was due to the internal and external hosts using the same port, leading to connectivity problems. I resolved this by changing the internal port to 19092:

KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:19092,PLAINTEXT_HOST://localhost:9092

This adjustment ensured that Kafka could communicate effectively within the Docker environment.

Configuring Flink in Session Mode

Running Flink in session mode allows multiple jobs to run in a single cluster. The following directives in the docker-compose.yml file were used to achieve this:


flink:
  image: custom-flink-image
  ports:
    - "8081:8081"
  environment:
    - JOB_MANAGER_RPC_ADDRESS=jobmanager
    - TASK_MANAGER_NUMBER_OF_TASK_SLOTS=2

Custom Docker Image for PyFlink

Given the limitations of the default Flink Docker image, I created a custom Docker image for pyFlink. This image includes all necessary dependencies and configurations to run pyFlink applications smoothly. The custom image ensures that all components of the data streaming stack are compatible and optimized for performance.

By following these steps, you can build and experiment with this streaming pipeline yourself. For a complete setup, clone the provided repository and refer to the detailed instructions in the README file. This guide is perfect for both beginners and experienced developers looking to streamline their data streaming stack.