Home > Enterprise >  how to create a azure data warehouse using parquet files incrementally
how to create a azure data warehouse using parquet files incrementally

Time:09-27

I am working on a project involving incremental loading data I need to implement an azure data warehouse in the following specifications:

example situation: I have 2 parquet files having the same structure, one of them is in the data lake and the other is already loaded to a table in a dedicated SQL-pool.

what steps should i go through to end up with a table that merges the 2 files (updating existing columns using a specific id and inserting new column when not found)

I would prefer not using external tables as they are slower in performance

CodePudding user response:

The target table has to be the same table in which the 2nd parquet file has already been loaded?

You could in any case define a simple Synapse pipeline in which you read both the parquet and the table from the dedicated sql pool, merge the two data flows, and the sink the result to the target sql table by means of an upsert

References:

  • [https://learn.microsoft.com/en-us/azure/data-factory/data-flow-alter-row#merges-and-upserts-with-azure-sql-database-and-azure-synapse][1]
  • [https://www.taygan.co/blog/2018/04/20/upsert-to-azure-sql-db-with-azure-data-factory]

CodePudding user response:

First of all, you can not update a parquet file without overwriting existing data which means you can not update rows or insert new records directly, but there is a way around.

You need to create layers of parquet files like,

  1. parquet files which contain your historical or old record.
  2. parquet files which contain your incremental record.
  3. parquet files created after merging the above two Once you have these parquet files you overwrite historical parquet file with merged parquet file , and for every iteration you update incremental file.

To merge these parquet files you can either use inbuilt tools of azure synapse copy activity or can convert parquet files in data frames, merge those data frames and create a parquet files from merged dataframes

  • Related