Merge pyspark data frame rows with one column value different-CodePudding

I want to merge two dataframe rows with one column value different. For example,

DataframeA:

firstName lastName age 
Alex      Smith    19
Rick      Mart     18

DataframeB:

firstName lastName age 
Alex      Smith    21

Result when I use merge DataframeA with DataframeB using union:

firstName lastName age 
Alex      Smith    19
Rick      Mart     18
Alex      Smith    21

What I want is that the rows with all column values same but different age should get combined as well, in a way that the age column has the max value. So, this is what I expect -

firstName lastName age 
Alex      Smith    21
Rick      Mart     18

Anyway I can achieve this in PySpark?

Thanks

CodePudding user response：

you could use a full join as well. you'll need to retain the greatest of the 2 ages (from 2 dataframes).

here's the example - I've added an additional row in the second dataframe that is not available in the first dataframe

data1_sdf.alias('a'). \
    join(data2_sdf.alias('b'), ['firstname', 'lastname'], 'full'). \
    selectExpr('firstname', 'lastname', 'greatest(a.age, b.age) as age'). \
    show()

#  --------- -------- --- 
# |firstname|lastname|age|
#  --------- -------- --- 
# |     Rick|   Smart| 20|
# |     Alex|   Smith| 21|
# |     Rick|    Mart| 18|
#  --------- -------- ---

dataframes used

 --------- -------- --- 
|firstname|lastname|age|
 --------- -------- --- 
|     Alex|   Smith| 19|
|     Rick|    Mart| 18|
 --------- -------- --- 

 --------- -------- --- 
|firstname|lastname|age|
 --------- -------- --- 
|     Alex|   Smith| 21|
|     Rick|   Smart| 20|
 --------- -------- ---

CodePudding user response：

After union you can easily aggregate the resultant set get the max of age

Data Preparation

s1 = StringIO("""
firstName,lastName,age
Alex,Smith,19
Rick,Mart,18
""")


df1 = pd.read_csv(s1,delimiter=',')

sparkDF1 = sql.createDataFrame(df1)


s2 = StringIO("""
firstName,lastName,age
Alex,Smith,21
""")


df2 = pd.read_csv(s2,delimiter=',')

sparkDF2 = sql.createDataFrame(df2)

sparkDF1.show()

 --------- -------- --- 
|firstName|lastName|age|
 --------- -------- --- 
|     Alex|   Smith| 19|
|     Rick|    Mart| 18|
 --------- -------- --- 

sparkDF2.show()

 --------- -------- --- 
|firstName|lastName|age|
 --------- -------- --- 
|     Alex|   Smith| 21|
 --------- -------- ---

Union & Max

sparkDF1.union(sparkDF2)\
        .groupBy(['firstName','lastName']).agg(F.max(F.col('age')).alias('age'))\
        .show()

 --------- -------- --- 
|firstName|lastName|age|
 --------- -------- --- 
|     Alex|   Smith| 21|
|     Rick|    Mart| 18|
 --------- -------- ---