I have a question about the best practice to insert incremental data to a partitioned table. Currently I have a historical table partitioned by day, I have another sql that produce the same table with two latest days' data. Here is a sample historical table and incremental table -
tbl1: historical
day | sales |
---|---|
2021-01-01 | 10 |
2021-01-02 | 15 |
2021-01-03 | 12 |
2021-01-04 | 10 |
tbl2: Incremental table (as of 1/6/2021)
day | sales |
---|---|
2021-01-04 | 25 |
2021-01-05 | 20 |
tbl1: Expected Output:
day | sales |
---|---|
2021-01-01 | 10 |
2021-01-02 | 15 |
2021-01-03 | 12 |
2021-01-04 | 25 |
2021-01-05 | 20 |
I tried insert into
and insert overwrite
, but it seems the insert into
will not replace the existing 1/4 data but will add additional line and create duplicates, while the insert overwrite
will remove all data from 1/1 to 1/3.
Sample code I wrote:
INSERT INTO tbl1 PARTITION (day) SELECT * FROM tbl2
CodePudding user response:
INSERT OVERWRITE rewrites a whole table or partition, but it works will many file formats. So to upsert using INSERT OVERWRITE you must first LEFT JOIN the new data with the existing data, and use that to replace the partition.
If you're using Delta you can MERGE instead.
CodePudding user response:
Not sure, what is your table format . but if your table is Parquet
or delta
format , handling this scenario is really simple . You can refer link