I haven't used Delta Lake's Change Data Feed
yet and I want to understand whether it could be relevant for us or not.
We have the following set-up:
- Raw Data (update events from DynamoDB) ends up in a Staging Area ->
- We clean the new data and append it to a Bronze Table ->
- We merge the appended data into a Silver Table, to represent the latest state ->
- We run SQL queries on top of the Silver Tables, joining and aggregating them, thus creating our Gold Tables
Currently we track new data by using Streaming Checkpoints. This is very efficient in the Bronze -> Silver stage, since it is append only.
To my understanding CDF could improve the performance of our Silver -> Gold jobs, since with Streaming Checkpoints you still have to read the whole parquet file if one line changed, with CDF you just read the table changes, correct?
Also, is there a reason to use CDF instead of streaming checkpoints in the Bronze -> Silver jobs?
CodePudding user response:
To my understanding CDF could improve the performance of our Silver -> Gold jobs, since with Streaming Checkpoints you still have to read the whole parquet file if one line changed, with CDF you just read the table changes, correct?
Yes, this is correct in principle. The nuance here is that you mentioned your Gold tables are joins aggregates, so you may actually need all that unchanged data anyway depending on the type of aggregates you have, and whether or not there is some referential integrity you need to maintain.
Also, is there a reason to use CDF instead of streaming checkpoints in the Bronze -> Silver jobs?
So long as this stage is append-only, no. In fact, if you were to enable CDF on this table, you wouldn't write any independent CDF files, so you would just be reading the same data files you are reading now, but with some additional metadata (change operation, version, timestamp) attached (that might be useful?)