What I have : 2 massive spark dataframes, but here are some samples
- Dataframe A:
ID | IG | OpenDate |
---|---|---|
P111 | 100 | 13/04/2022 |
P222 | 101 | 16/04/2022 |
P333 | 102 | 20/04/2022 |
- Dataframe B:
IG | Service | Dt_Service |
---|---|---|
100 | A | 12/04/2022 |
100 | B | 13/04/2022 |
100 | B | 14/04/2022 |
101 | A | 15/04/2022 |
101 | A | 16/04/2022 |
101 | B | 17/04/2022 |
101 | B | 18/04/2022 |
102 | A | 19/04/2022 |
102 | B | 20/04/2022 |
What I want: I want to left join on dataframe A the two columns 'Service' and 'Dt_Service' using the key 'IG' but also having the Max value of 'Service' with the corresponding date. So I need the most recent 'Service' with its corresponding date for each row in Dataframe A. This is the result I expect :
ID | IG | OpenDate | Service | Dt_Service |
---|---|---|---|---|
P111 | 100 | 13/04/2022 | B | 14/04/2022 |
P222 | 101 | 16/04/2022 | B | 18/04/2022 |
P333 | 102 | 20/04/2022 | B | 20/04/2022 |
Tool : Spark 2.2 with PySpark since I am working on hadoop
Thank you for your help
CodePudding user response:
As samkart said we can do rank/row_number to get last service first then join to get your desired result
from pyspark.sql import functions as F
from pyspark.sql import Window
se="IG string,Service string,Dt_Service string"
de=[("100","A","2022-04-12"),("100","B","2022-04-13"),("100","B","2022-04-14"),("101","A","2022-04-15"),("101","A","2022-04-16"),("101","B","2022-04-17"),("101","B","2022-04-18"),("102","A","2022-04-19"),("102","B","2022-04-20")]
df1=spark.createDataFrame([("P111","100","13/04/2022"),("P222","101","16/04/2022"),("P333","102","20/04/2022")],"ID string,IG string, OpenDate string")
df2=fd.withColumn("rn",F.row_number().over(Window.partitionBy("ig").orderBy(F.to_date(F.col("Dt_service")).desc()))).filter("rn==1").drop("rn")
df1.join(df2,"IG","inner").show()
#output
--- ---- ---------- ------- ----------
| IG| ID| OpenDate|Service|Dt_Service|
--- ---- ---------- ------- ----------
|100|P111|13/04/2022| B|2022-04-14|
|101|P222|16/04/2022| B|2022-04-18|
|102|P333|20/04/2022| B|2022-04-20|
--- ---- ---------- ------- ----------