Databricks Type 2 / SLCD Updates via Spark Structured Streaming-CodePudding

I have many times implemented slowly changing dimensions on DWH's in batch whereby a set of changes greater than 1, for a given business key, can be processed. No sweat.

Using followingi:

Spark non-Structured Streaming Programs
ETL tools
PL/SQL

However, using Structured Streaming this is not possible as far as I can see due to multi-step nature and the limitations that prevail with Spark Structured Streaming.

Or is this possible? If so, please advise if there is a method?

CodePudding user response：

Yes, it's possible, but you need to have some code to implement it. From your update dataframe you need to create a union of:

updates themselves, that will have a full merge key - they will match you'll set current = false and end_date = date_of_new_record
a result of inner join with the target table, but setting the merge key(s) to NULL, so they will not match & will be inserted as new rows with current = true and end_date = null

Code is from official documentation (and notebook):

-- These rows will either UPDATE the current addresses of existing 
-- customers or INSERT the new addresses of new customers

SELECT updates.customerId as mergeKey, updates.* FROM updates

UNION ALL
 
-- These rows will INSERT new addresses of existing customers 
-- Setting the mergeKey to NULL forces these rows 
— to NOT MATCH and be INSERTed.

SELECT NULL as mergeKey, updates.*
FROM updates JOIN customers
ON updates.customerid = customers.customerid 
WHERE customers.current = true 
  AND updates.address <> customers.address

And then this resulting dataframe is used in MERGE statement that is called from the .foreachBatch