Upserting S3 Data Using Apache Hudi with PySpark

Apache Hudi is a powerful framework for managing storage of large datasets like those typically found in Amazon S3. In this article, we explore how to use Apache Hudi with PySpark to perform upserts — a combination of insert and update operations — on S3 data. This functionality is particularly useful in data engineering and data science workflows that require frequent updates to large datasets.

What are Upserts and Why are They Important?

Upserts, as the name suggests, are a blend of 'update' and 'insert' operations. In the context of database management and data processing, an upsert operation updates existing records if they exist, and inserts new records if they do not. This operation is crucial in scenarios where data needs to be frequently updated without duplicating entries. For instance, upserts are essential in maintaining the latest state of data in real-time applications or when working with data streams that continuously generate new data while also updating existing records.

How Does Apache Hudi Enable Upserts?

Apache Hudi brings stream processing to big data, providing fresh data while also efficiently handling updates. It allows for near real-time ingestion and updates, making it a perfect choice for use cases that require fast data retrieval and modifications. Hudi achieves this by maintaining a timeline of commits, which tracks the changes over time, enabling efficient upserts and even deletes.

Upserting Data on S3 Using Apache Hudi and PySpark

Now, let’s dive into the technical process of upserting S3 data using Apache Hudi and PySpark.

Setting Up the Environment

First, ensure that you have PySpark and Hudi libraries installed in your environment. You can install these using pip:

pip install pyspark
pip install hudi

Next, import the necessary libraries in your PySpark script:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

Initializing Spark Session with Hudi

Initialize the Spark session to include Hudi. This step is crucial for enabling the functionalities provided by Apache Hudi:

spark = SparkSession.builder \
    .appName("Hudi Upsert Example") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.sql.extensions", "org.apache.hudi") \
    .getOrCreate()

Writing Data to S3 with Upsert Functionality

Assuming you have a DataFrame df that you wish to write to S3 with upsert capabilities, here's how you can do it:

# Define Hudi options
hudiOptions = {
    'hoodie.table.name': 'my_hudi_table',
    'hoodie.datasource.write.recordkey.field': 'id',
    'hoodie.datasource.write.partitionpath.field': 'partitionpath',
    'hoodie.datasource.write.precombine.field': 'timestamp',
    'hoodie.upsert.shuffle.parallelism': '2',
    'hoodie.insert.shuffle.parallelism': '2'
}

# Writing data to S3
df.write.format("hudi") \
  .options(**hudiOptions) \
  .mode("append") \
  .save("s3://your-bucket-name/your-hudi-table/")

This script initializes a Spark session, sets up Hudi configuration, and writes the DataFrame to S3 with the ability to handle upserts. The 'append' mode in the write operation ensures that existing data is updated, and new data is added as required.

Conclusion

Apache Hudi simplifies handling large datasets with frequent updates. Its ability to perform efficient upserts makes it an invaluable tool in modern data architectures, especially when combined with the power of PySpark. By following the steps outlined in this article, you can seamlessly integrate Apache Hudi into your data pipelines and efficiently manage data in S3.

Hudi Data Lake

Spark Performance Guide

Spark CDC

Spark on Windows

Spark Configuration Guide

Spark SQL