Home > database >  Overriding values in dataframe while joining 2 dataframes
Overriding values in dataframe while joining 2 dataframes

Time:04-27

In the below example I would like to override the values in Spark Dataframe A with corresponding value in Dataframe B (if it exists). Is there a way to do it using Spark (Scala)?

Dataframe A

ID Name Age
1  Paul 30
2  Sean 35
3  Rob  25

Dataframe B

ID Name Age
1  Paul 40

Result

ID Name Age
1  Paul 40
2  Sean 35
3  Rob  25

CodePudding user response:

The combined use of a left join and coalesce should do the trick, something like:

dfA
  .join(dfB, "ID", "left")
  .select(
    dfA.col("ID"),
    dfA.col("Name"),
    coalesce(dfB.col("Age"), dfA.col("Age")).as("Age")
  )

Explanation: For a specific ID some_id, there is 2 cases:

  1. if dfB does not contain some_id: then the left join will produce null for dfB.col("Age") and the coalesce will return the first non-null value from expressions we passed to it, i.e. the value of dfA.col("Age")
  2. if dfB contains some_id then the value from dfB.col("Age") will be used.
  • Related