I have the following delta table
- ----
|A|B |
- ----
|1|10 |
|1|null|
|2|20 |
|2|null|
- ----
I want to fill the null values in column B based on the A column.
I figured this to do so:
var df = spark.sql("select * from MyDeltaTable")
val w = Window.partitionBy("A")
df = df.withColumn("B", last("B", true).over(w))
Which gives me the desired output:
- ----
|A|B |
- ----
|1|10 |
|1|10 |
|2|20 |
|2|20 |
- ----
Now, my question is:
What is the best way to write the result in my delta table correctly ?
Should I merge ? Re-write with overwrite option ?
My delta table us huge and it will keep on increasing, I am looking for the best possible method to achieve so.
Thank you
CodePudding user response:
It depends on the distribution of the rows (aka. are they all in 1 file or spread through many?) that contain null
values you'd like to fill.
MERGE
will rewrite entire files, so you may end up rewriting enough of the table to justify simply overwriting it instead. You'll have to test this to determine what's best for your use case.
Also, to use MERGE
, you need to filter the dataset down only to the changes. Your example "desired output" table has the all the data, which you'd fail to MERGE
in its current state because there are duplicate keys.
Check the Important! section in the docs for more