I want to merge two dataframe rows with one column value different. For example,
DataframeA:
firstName lastName age
Alex Smith 19
Rick Mart 18
DataframeB:
firstName lastName age
Alex Smith 21
Result when I use merge DataframeA with DataframeB using union
:
firstName lastName age
Alex Smith 19
Rick Mart 18
Alex Smith 21
What I want is that the rows with all column values same but different age should get combined as well, in a way that the age column has the max value. So, this is what I expect -
firstName lastName age
Alex Smith 21
Rick Mart 18
Anyway I can achieve this in PySpark?
Thanks
CodePudding user response:
you could use a full join as well. you'll need to retain the greatest
of the 2 ages (from 2 dataframes).
here's the example - I've added an additional row in the second dataframe that is not available in the first dataframe
data1_sdf.alias('a'). \
join(data2_sdf.alias('b'), ['firstname', 'lastname'], 'full'). \
selectExpr('firstname', 'lastname', 'greatest(a.age, b.age) as age'). \
show()
# --------- -------- ---
# |firstname|lastname|age|
# --------- -------- ---
# | Rick| Smart| 20|
# | Alex| Smith| 21|
# | Rick| Mart| 18|
# --------- -------- ---
dataframes used
--------- -------- ---
|firstname|lastname|age|
--------- -------- ---
| Alex| Smith| 19|
| Rick| Mart| 18|
--------- -------- ---
--------- -------- ---
|firstname|lastname|age|
--------- -------- ---
| Alex| Smith| 21|
| Rick| Smart| 20|
--------- -------- ---
CodePudding user response:
After union
you can easily aggregate the resultant set get the max
of age
Data Preparation
s1 = StringIO("""
firstName,lastName,age
Alex,Smith,19
Rick,Mart,18
""")
df1 = pd.read_csv(s1,delimiter=',')
sparkDF1 = sql.createDataFrame(df1)
s2 = StringIO("""
firstName,lastName,age
Alex,Smith,21
""")
df2 = pd.read_csv(s2,delimiter=',')
sparkDF2 = sql.createDataFrame(df2)
sparkDF1.show()
--------- -------- ---
|firstName|lastName|age|
--------- -------- ---
| Alex| Smith| 19|
| Rick| Mart| 18|
--------- -------- ---
sparkDF2.show()
--------- -------- ---
|firstName|lastName|age|
--------- -------- ---
| Alex| Smith| 21|
--------- -------- ---
Union & Max
sparkDF1.union(sparkDF2)\
.groupBy(['firstName','lastName']).agg(F.max(F.col('age')).alias('age'))\
.show()
--------- -------- ---
|firstName|lastName|age|
--------- -------- ---
| Alex| Smith| 21|
| Rick| Mart| 18|
--------- -------- ---