Scala: best way to update a deltatable after filling missing values-CodePudding

I have the following delta table

 - ---- 
|A|B   |
 - ---- 
|1|10  |
|1|null|
|2|20  |
|2|null|
 - ----

I want to fill the null values in column B based on the A column.

I figured this to do so:

var df = spark.sql("select * from MyDeltaTable")
val w = Window.partitionBy("A")
df = df.withColumn("B", last("B", true).over(w))

Which gives me the desired output:

 - ---- 
|A|B   |
 - ---- 
|1|10  |
|1|10  |
|2|20  |
|2|20  |
 - ----

Now, my question is:

What is the best way to write the result in my delta table correctly ?

Should I merge ? Re-write with overwrite option ?

My delta table us huge and it will keep on increasing, I am looking for the best possible method to achieve so.

Thank you

CodePudding user response：

It depends on the distribution of the rows (aka. are they all in 1 file or spread through many?) that contain null values you'd like to fill.

MERGE will rewrite entire files, so you may end up rewriting enough of the table to justify simply overwriting it instead. You'll have to test this to determine what's best for your use case.

Also, to use MERGE, you need to filter the dataset down only to the changes. Your example "desired output" table has the all the data, which you'd fail to MERGE in its current state because there are duplicate keys.

Check the Important! section in the docs for more