Let's say that I have these two dataframes
df1 : | Name| Surname | email
John Smith [email protected]
Jake Smith [email protected]
Anna Hendrix [email protected]
Kale Kinderstone [email protected]
George Hiddleston [email protected]
Patrick Huston [email protected]
df2 : | Name| Surname | email
John Smith [email protected]
Hannah Montana [email protected]
Anna Hendrix [email protected]
Kale Kinderstone [email protected]
Ivan Gaganovitch [email protected]
Florence Jekins [email protected]
What I want to do is replacing some particular emails without touching the rest of the data. So, the final product I want to make is
df3 : | Name| Surname | email
John Smith [email protected]
Jake Smith [email protected]
Anna Hendrix [email protected]
Kale Kinderstone [email protected]
George Hiddleston [email protected]
Patrick Huston [email protected]
At the beginning, I tried joining them by concatenating the names and using the column as key, but then I got stuck on how to process the column and then how to remove the df2 data.
CodePudding user response:
Join the dataframes, but use alias
on them. Then you will be able to choose between columns of the same name.
df3 = (df1.alias('a')
.join(df2.alias('b'),['Name', 'Surname'], 'left')
.select(
'Name',
'Surname',
F.coalesce('b.email', 'a.email').alias('email')
)
)
df3.show()
# ------- ----------- --------------------
# | Name| Surname| email|
# ------- ----------- --------------------
# | Anna| Hendrix| [email protected]|
# | Jake| Smith| [email protected]|
# | John| Smith| [email protected]|
# |Patrick| Huston|Huston1990@yahoom...|
# | George| Hiddleston| [email protected]|
# | Kale|Kinderstone|KKinderstone@ymai...|
# ------- ----------- --------------------