In the below example I would like to override the values in Spark Dataframe A with corresponding value in Dataframe B (if it exists). Is there a way to do it using Spark (Scala)?
Dataframe A
ID Name Age
1 Paul 30
2 Sean 35
3 Rob 25
Dataframe B
ID Name Age
1 Paul 40
Result
ID Name Age
1 Paul 40
2 Sean 35
3 Rob 25
CodePudding user response:
The combined use of a left join
and coalesce
should do the trick, something like:
dfA
.join(dfB, "ID", "left")
.select(
dfA.col("ID"),
dfA.col("Name"),
coalesce(dfB.col("Age"), dfA.col("Age")).as("Age")
)
Explanation: For a specific ID some_id
, there is 2 cases:
- if
dfB
does not containsome_id
: then the left join will produce null fordfB.col("Age")
and thecoalesce
will return the first non-null value from expressions we passed to it, i.e. the value ofdfA.col("Age")
- if
dfB
containssome_id
then the value fromdfB.col("Age")
will be used.