The backbone of any data lake is cheap data storage and processing using an object store and schema-on-read data tables. This allows for cost-effective data analytics using Apache Spark, Athena, and other data lake tools.
The mistake that some companies make if trying to use one data store for all use cases. That is an unachievable strategy. You cannot get cheap data storage and analytics and fast random seek on the same data store. Many projects have tried and some get verify close but there is always a sacrifice.
Here are a few examples:
The bottom line is that you cannot violate the laws of physics so just accept that fact and use a datastore for each use case, if that means you have 3 different datastores then so be it.
Using s3 + Glue Tables is the backbone that covers most use cases especially for big data analytics shops. We simply use Apache Spark atop s3 + Glue to find insights, models, analytics, and output the results to DynamoDB, Redshift, RDBMS,etc.
The difficulty that some teams have that pushes them away from s3 is that s3 data is immutable (it cannot be updated). This leads to difficult ETL strategies using full joins, multiple datasets, rewriting/overwriting data, etc. not to mention keeping track of partitions
The Apache Hudi Project seeks to bridge the gap and provide reliable upserts to s3 data which simplifies data pipelines and eliminates the barrier of keeping data up-to-date in s3. There are several architectures that we can build with the Hudi framework and many configurations within each architecture. It can be difficult to select the proper architecture and then implement pipelines that reliably and accurately update data.
In this article we seek to implement a data pipeline that will keep two datasets in sync using Hudi's upsert capabilities