Home > Mobile >  How to process a dataframe column based on another dataframe column in PySpark?
How to process a dataframe column based on another dataframe column in PySpark?

Time:08-02

Let's say that I have these two dataframes

df1 :   | Name|    Surname      | email   
          John      Smith         [email protected]
          Jake      Smith         [email protected] 
          Anna      Hendrix       [email protected]      
          Kale      Kinderstone   [email protected]
         George     Hiddleston    [email protected]
        Patrick     Huston        [email protected]


df2 :   | Name|    Surname      | email   
          John      Smith         [email protected]
         Hannah     Montana       [email protected] 
          Anna      Hendrix       [email protected]      
          Kale      Kinderstone   [email protected]
         Ivan       Gaganovitch   [email protected]
        Florence     Jekins       [email protected]

What I want to do is replacing some particular emails without touching the rest of the data. So, the final product I want to make is

df3 :   | Name|    Surname      | email   
          John      Smith         [email protected]
          Jake      Smith         [email protected] 
          Anna      Hendrix       [email protected]      
          Kale      Kinderstone   [email protected]
         George     Hiddleston    [email protected]
        Patrick     Huston        [email protected]

At the beginning, I tried joining them by concatenating the names and using the column as key, but then I got stuck on how to process the column and then how to remove the df2 data.

CodePudding user response:

Join the dataframes, but use alias on them. Then you will be able to choose between columns of the same name.

df3 = (df1.alias('a')
    .join(df2.alias('b'),['Name', 'Surname'], 'left')
    .select(
        'Name',
        'Surname',
        F.coalesce('b.email', 'a.email').alias('email')
    )
)
df3.show()
#  ------- ----------- -------------------- 
# |   Name|    Surname|               email|
#  ------- ----------- -------------------- 
# |   Anna|    Hendrix|  [email protected]|
# |   Jake|      Smith| [email protected]|
# |   John|      Smith|    [email protected]|
# |Patrick|     Huston|Huston1990@yahoom...|
# | George| Hiddleston|      [email protected]|
# |   Kale|Kinderstone|KKinderstone@ymai...|
#  ------- ----------- -------------------- 
  • Related