Home > Back-end >  Spark Delta table updates based on multiple rows - Pyspark - deltaLake
Spark Delta table updates based on multiple rows - Pyspark - deltaLake

Time:07-20

I need to update delta table based on lookup file rows .

lookup file has two columns a_acc & b_acc, it will have mulitple rows. i need to update b_acc value in delta table based on a_acc column value in lookup in delta table .

Lookup table
 ------------
|a_acc|b_acc|
 -----------
| 4636|1999|
| 1023|892 |
| 3333|1111|


Delta Table
 ------------
|a_acc|b_acc|
 -----------
| 4636|0123|
| 1023|843 |
| 3333|3232|

output Delta table:

 ------------
|a_acc|b_acc|
 -----------
| 4636|1999|
| 1023|892 |
| 3333|1111|

For single single row i can update the values in delta table .

deltaTable.update(condition = "a_acc = '4636'",set = {"b_acc": "1999"})

But how can i loop all the value in lookup table and update correspondingly ?

CodePudding user response:

You can use merge to update the values (b_acc) in delta table when matching key found in lookup table (a_acc) ,

Below is sample code for same

deltaTable.alias("dt").merge(
    source = Lookuptable.alias("lt"),
    condition = "dt.a_acc = lt.a_acc"
  ).whenMatchedUpdate(set =
    {
      "b_acc": "lt.b_acc"
    }
  ).execute()

CodePudding user response:

hope this can help u.

d1 =[[4636, 1999],[1023,892],[3333,1111], [444, 123]]
lookup_table = spark.createDataFrame(d1, ['a_acc', 'b_acc'])
d2 =[[4636, 123],[1023, 843],[3333,3232], [567, 221]]
delta_table = spark.createDataFrame(d2, ['a_acc', 'b_acc'])

(
    delta_table
        .join(lookup_table.withColumnRenamed("b_acc", "lookup_b_acc"), ["a_acc"], "left")
        .withColumn("b_acc", coalesce(col("lookup_b_acc"), col("b_acc")))
    .sort("a_acc")
    .show()
)
 ----- ----- ------------ 
|a_acc|b_acc|lookup_b_acc|
 ----- ----- ------------ 
|  567|  221|        null|
| 1023|  892|         892|
| 3333| 1111|        1111|
| 4636| 1999|        1999|
 ----- ----- ------------ 
  • Related