I have many times implemented slowly changing dimensions on DWH's in batch whereby a set of changes greater than 1, for a given business key, can be processed. No sweat.
Using followingi:
- Spark non-Structured Streaming Programs
- ETL tools
- PL/SQL
However, using Structured Streaming this is not possible as far as I can see due to multi-step nature and the limitations that prevail with Spark Structured Streaming.
Or is this possible? If so, please advise if there is a method?
CodePudding user response:
Yes, it's possible, but you need to have some code to implement it. From your update dataframe you need to create a union of:
- updates themselves, that will have a full merge key - they will match you'll set
current = false
andend_date = date_of_new_record
- a result of inner join with the target table, but setting the merge key(s) to NULL, so they will not match & will be inserted as new rows with
current = true
andend_date = null
Code is from official documentation (and notebook):
-- These rows will either UPDATE the current addresses of existing
-- customers or INSERT the new addresses of new customers
SELECT updates.customerId as mergeKey, updates.* FROM updates
UNION ALL
-- These rows will INSERT new addresses of existing customers
-- Setting the mergeKey to NULL forces these rows
— to NOT MATCH and be INSERTed.
SELECT NULL as mergeKey, updates.*
FROM updates JOIN customers
ON updates.customerid = customers.customerid
WHERE customers.current = true
AND updates.address <> customers.address
And then this resulting dataframe is used in MERGE statement that is called from the .foreachBatch