I am using a very simple architecture to copy data from an external source into Azure Data Lake Storage gen 2 and serve it to PowerBI via a Serverless pool (where I perform some aggregations).
For the initial load, I used CopyData activity (Synapse Pipeline) and I store the data in parquet files. Since parquet / ADLS2 does not support UPDATE operations on files, I am looking for best practices to create the incremental load (watermarking process) without using an additional database from where I can query the control/watermark table and run the store procedure to update the last run date.
Has anyone bumped into this before? Thanks!
PS: I first checked here the best practice: https://docs.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-overview
CodePudding user response:
Synapse Serverless SQL pools can write data to ADLS using the CETAS command. If you load from the source any incremental inserts, updates or deletes to a parquet file, you can use a CETAS to full outer join the old table with the incremental changes and create a new table in a new folder. It will probably require a little orchestration with a Synapse pipeline to delete the old folder and rename the new folder at the end of the process. We have used this exact approach with Synapse pipelines and Synapse Serverless SQL CETAS before and it works well for moderate sized data.
I wouldn’t use it for TB sized data because you are completely rewriting the parquet files for the table every day. In that situation I would use Databricks or Synapse Spark compute to maintain Delta Lake tables as the final storage in ADLS since Delta Lake supports updates. Synapse Serverless SQL can still query Delta lake tables.