Synchronizing Kafka Topics with S3 using Apache Hudi DeltaStreamer

Integrating real-time data streams into a scalable and queryable data lake is a common challenge in modern data architectures. Apache Hudi's DeltaStreamer tool offers a powerful and efficient solution for this, particularly in synchronizing data from Apache Kafka topics directly into an Amazon S3 data lake. In this article, we will explore how to use DeltaStreamer for this purpose, highlighting its ease of use and no-coding-required capabilities.

Understanding Apache Hudi DeltaStreamer

Apache Hudi's DeltaStreamer is a tool designed to ingest and process data from a variety of sources, including Kafka, and store it into Hudi datasets. What makes DeltaStreamer stand out is its ability to handle these tasks with minimal configuration and without the need for writing complex code. This no-code approach makes it an ideal choice for data engineers and architects looking to streamline their data ingestion pipelines.

Setting Up DeltaStreamer for Kafka to S3 Synchronization

To synchronize data from a Kafka topic to an S3 bucket using DeltaStreamer, follow these steps:

Prerequisites

Ensure you have Apache Hudi and Apache Kafka set up in your environment. You can install Hudi using pip:

pip install hudi

Also, make sure your Kafka and AWS S3 services are correctly configured and accessible.

Configuring DeltaStreamer

DeltaStreamer uses a properties file for configuration. Here’s an example configuration for Kafka to S3 synchronization:

# Kafka source options
hoodie.deltastreamer.source.kafka.topic=your_kafka_topic
hoodie.deltastreamer.kafka.source.maxOffsetsPerTrigger=10000
hoodie.deltastreamer.kafka.group.id=hudi_delta_streamer

# S3 storage options
hoodie.deltastreamer.s3.base.path=s3://your-bucket-name/your-hudi-table
hoodie.deltastreamer.s3.table.type=COPY_ON_WRITE

# General Hudi options
hoodie.upsert.shuffle.parallelism=2
hoodie.insert.shuffle.parallelism=2

Running DeltaStreamer

Once configured, you can run DeltaStreamer using a simple command. Here’s an example that uses the Hudi CLI:

./hudi-delta-streamer \
    --table-type COPY_ON_WRITE \
    --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
    --source-ordering-field kafkaOffset \
    --target-base-path s3://your-bucket-name/your-hudi-table \
    --target-table your_hudi_table \
    --props file://path/to/your/properties/file \
    --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider

This command will start the DeltaStreamer, which listens to your Kafka topic and syncs the data into the specified S3 bucket, creating a Hudi table.

Conclusion

Apache Hudi's DeltaStreamer simplifies the task of synchronizing real-time data streams with a data lake. Its ability to directly integrate Kafka topics into an S3-based Hudi dataset, without the need for extensive coding, makes it an invaluable tool in modern data pipelines. By following the steps outlined in this article, you can efficiently set up a robust pipeline that keeps your data lake synchronized with your real-time data sources.

Hudi Upserts

Hudi Data Lake