Home > Net >  Left Join with conditions and aggregate MAX using Spark Python / PySpark
Left Join with conditions and aggregate MAX using Spark Python / PySpark

Time:06-13

What I have : 2 massive spark dataframes, but here are some samples

  • Dataframe A:
ID IG OpenDate
P111 100 13/04/2022
P222 101 16/04/2022
P333 102 20/04/2022
  • Dataframe B:
IG Service Dt_Service
100 A 12/04/2022
100 B 13/04/2022
100 B 14/04/2022
101 A 15/04/2022
101 A 16/04/2022
101 B 17/04/2022
101 B 18/04/2022
102 A 19/04/2022
102 B 20/04/2022

What I want: I want to left join on dataframe A the two columns 'Service' and 'Dt_Service' using the key 'IG' but also having the Max value of 'Service' with the corresponding date. So I need the most recent 'Service' with its corresponding date for each row in Dataframe A. This is the result I expect :

ID IG OpenDate Service Dt_Service
P111 100 13/04/2022 B 14/04/2022
P222 101 16/04/2022 B 18/04/2022
P333 102 20/04/2022 B 20/04/2022

Tool : Spark 2.2 with PySpark since I am working on hadoop

Thank you for your help

CodePudding user response:

As samkart said we can do rank/row_number to get last service first then join to get your desired result

from pyspark.sql import functions as F 
from pyspark.sql import Window
se="IG string,Service string,Dt_Service string"
de=[("100","A","2022-04-12"),("100","B","2022-04-13"),("100","B","2022-04-14"),("101","A","2022-04-15"),("101","A","2022-04-16"),("101","B","2022-04-17"),("101","B","2022-04-18"),("102","A","2022-04-19"),("102","B","2022-04-20")]

df1=spark.createDataFrame([("P111","100","13/04/2022"),("P222","101","16/04/2022"),("P333","102","20/04/2022")],"ID string,IG string, OpenDate string")
df2=fd.withColumn("rn",F.row_number().over(Window.partitionBy("ig").orderBy(F.to_date(F.col("Dt_service")).desc()))).filter("rn==1").drop("rn")
df1.join(df2,"IG","inner").show()

#output
 --- ---- ---------- ------- ---------- 
| IG|  ID|  OpenDate|Service|Dt_Service|
 --- ---- ---------- ------- ---------- 
|100|P111|13/04/2022|      B|2022-04-14|
|101|P222|16/04/2022|      B|2022-04-18|
|102|P333|20/04/2022|      B|2022-04-20|
 --- ---- ---------- ------- ---------- 
  • Related