DF1:
accountID | cost |
---|---|
act1 | 100 |
act2 | 150 |
DF2:
accountID | serviceType | region | date | cost | output |
---|---|---|---|---|---|
act1 | s1 | us-east-1 | 2022-07-01 | 20 | |
act1 | s1 | us-east-1 | 2022-07-02 | 50 | |
act1 | s1 | us-east-1 | 2022-07-30 | 100 | |
act2 | s1 | us-east-1 | 2022-07-01 | 100 | |
act2 | s1 | us-east-1 | 2022-07-30 | 150 |
excepted output: when accountID is matched between dataframes, output column should be value filled from DF1
accountID | serviceType | region | date | cost | output |
---|---|---|---|---|---|
act1 | s1 | us-east-1 | 2022-07-01 | 20 | 100 |
act1 | s1 | us-east-1 | 2022-07-02 | 50 | 100 |
act1 | s1 | us-east-1 | 2022-07-30 | 100 | 100 |
act2 | s1 | us-east-1 | 2022-07-01 | 100 | 150 |
act2 | s1 | us-east-1 | 2022-07-30 | 150 | 150 |
I tried with withColumn and joins but it is not working.
val outputDF = df2.withColumn("output",when(col("accountID") === partitionedDf("accountID"),partitionedDf("cost")))
related post in python however I want it in scala. assign values of one dataframe column to another dataframe column based on condition
CodePudding user response:
A join
is what you are looking for.
Its purpose is exactly to merge related data from different tables/datasets.
df1.join(df2.withColumnRenamed("cost", "output"), "accountID")
Since in both df1
and df2
you have a column cost
and you want the df2.cost
to be renamed into output
, it is convenient (but not required) to do it before the join (df2.withColumnRenamed("cost", "output")
).
In this case a left join is used (the default, since not specified) and the join condition is the equality of the accountID
column.