In an era where data flows continuously from countless sources, real-time data processing is essential for businesses to stay competitive.
Whether it’s processing financial transactions, monitoring IoT devices or providing personalized content to users, organizations need robust and scalable infrastructures to handle these high-velocity data streams.
Apache Kafka, Apache Flink and SingleStore are three powerful technologies that, when combined, form an efficient and reliable real-time data processing pipeline. Kafka serves as the backbone for distributed messaging, ensuring data streams are reliably ingested. Flink provides advanced capabilities for stateful stream processing, allowing for complex transformations and windowing on the fly. Finally, SingleStore acts as a high-performance, scalable relational database capable of storing and querying processed data in real time, ensuring low-latency responses to analytical queries.
In this blog, we’ll walk through building and deploying a real-time data pipeline that simulates data ingestion, processes it using Flink and stores the results in SingleStore using Docker and Kubernetes — giving you a clear picture of how to handle real-time data at scale.
Apache Flink
Apache Flink is a powerful stream-processing framework designed for large-scale, real-time data analytics. Unlike traditional batch processing systems, Flink excels at handling continuous streams of data, enabling low-latency and high-throughput processing. Its event-driven architecture allows it to process each data event as it arrives, making it ideal for time-sensitive applications like real-time fraud detection, recommendation systems and IoT monitoring.
Flink also supports stateful computations, fault tolerance and complex event processing, making it versatile for both real-time and batch workloads. With built-in connectors to data sources like Kafka and support for various APIs, Flink is a go-to choice for building scalable, distributed stream-processing applications.
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
SingleStore
Designed as a database to power intelligent applications, SingleStore can read, write and reason on petabyte-scale data in a few milliseconds.
Project overview
This demo project showcases how a real-time data pipeline can be set up using these technologies. The pipeline consists of two primary components:
- Kafka Producer. Simulates data and sends it to a Kafka topic.
- Flink processor. Reads the data from Kafka, processes it in real time and writes the results to a SingleStore database.
Project structure
The project is split into two Maven-based services, each containerized with Docker:
- Kafka Producer. Generates and sends data to Kafka.
- Flink processor. Consumes data from Kafka, processes it and stores the results in SingleStore.
Here’s what the folder structure looks like:
├── kafka-producer/ # Maven project for the Kafka Producer│ └── Dockerfile├── flink-processor/ # Maven project for the Flink Processor│ └── Dockerfile├── docker-compose.yml # Docker Compose setup for Kafka,Zookeeper, producer, and processor└── README.md # Project documentation
Step 1. Setting up Kafka and Zookeeper
In the real-time pipeline, Kafka plays a critical role by acting as the distributed messaging platform, ensuring that data is published and consumed in a reliable, scalable manner. To manage Kafka's metadata and maintain synchronization across the cluster, we use Zookeeper.
The Kafka and Zookeeper setup is managed via Docker Compose:
services:zookeeper:image: confluentinc/cp-zookeeper:latestenvironment:ZOOKEEPER_CLIENT_PORT: 32181networks:bridge:kafka:image: confluentinc/cp-kafkadepends_on:- zookeeperenvironment:KAFKA_BROKER_ID: 1KAFKA_ZOOKEEPER_CONNECT: zookeeper:32181networks:bridge:
Step 2. Building the Kafka Producer
The Kafka Producer is responsible for generating simulation data and pushing it to a Kafka topic. This producer is a Maven project, and we’ll build it into a Docker image for ease of deployment.
Dockerfile for Kafka Producer:
FROM openjdk:8u151-jdk-alpine3.7
RUN apk add --no-cache bash libc6-compat
WORKDIR /
COPY wait-for-it.sh wait-for-it.sh
COPY target/kafka2singlestore-1.0-SNAPSHOT-jar-with-dependencies.jar
k-producer.jar
CMD ./wait-for-it.sh -s -t 30 $ZOOKEEPER_SERVER -- ./wait-for-it.sh -s
-t 30 $KAFKA_SERVER -- java -Xmx512m -jar k-producer.jar
To build and package the Kafka Producer:
cd kafka-producer/
mvn clean package
docker build -t kproducer .
Step 3. Real-time stream processing with Apache Flink
Once data is in Kafka, we use Apache Flink to process it. The Flink processor reads from Kafka, applies transformations or windowing operations and writes the results to SingleStore using a JDBC sink.
Dockerfile for Flink processor:
FROM openjdk:8u151-jdk-alpine3.7
RUN apk add --no-cache bash libc6-compat
WORKDIR /
COPY wait-for-it.sh wait-for-it.sh
COPY
target/flink-kafka2postgres-1.0-SNAPSHOT-jar-with-dependencies.jar
flink-processor.jar
CMD ./wait-for-it.sh -s -t 30 $ZOOKEEPER_SERVER -- ./wait-for-it.sh -s
-t 30 $KAFKA_SERVER -- java -Xmx512m -jar flink-processor.jar
To build and package the Flink processor:
cd flink-processor/
mvn clean package
docker build -t flink-processor .
Step 4. Storing data in SingleStore
The final step is to configure the Flink processor to store processed data into a SingleStore database using the JDBC sink. Here’s a snippet of the JDBC configuration for Flink:
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:singlestore://<<hostname>>:<<port>>/<<database>>")
.withDriverName("com.singlestore.jdbc.Driver")
.withUsername("<<username>>")
.withPassword("<<password>>")
.build();
Be sure to replace <<hostname>>
, <<port>>
, <<database>>
, <<username>>
, and <<password>>
with your actual SIngleStore database credentials.
Step 5. Running everything with Docker Compose
Once the Kafka producer and Flink processor are built into Docker images, you can bring up the entire system using Docker Compose. This command will start Zookeeper, Kafka, the Kafka Producer and the Flink Processor:
docker-compose up
Step 6. Customization
- Data generation frequency. Modify the
PRODUCER_INTERVAL
environment variable in thedocker-compose.yml
file to control how frequently the Kafka producer generates data. - Database credentials. Ensure the Flink processor is configured with the correct database details for SingleStore.
With this setup, you have a real-time pipeline that produces, processes and stores data seamlessly. Apache Flink ensures that data can be processed as soon as it arrives, and SingleStore provides the performance and scalability necessary to handle large-scale, real-time data workloads.
This project can be extended to handle more complex use cases including multi-topic Kafka pipelines, more advanced stream processing in Flink and integrating other data sinks. The full codebase is available here.
Feel free to fork the repository, and start building your real-time streaming pipeline with SingleStore today!