Home > Enterprise >  Join pyspark based on the most recent record
Join pyspark based on the most recent record

Time:11-25

I need to join these datarames:

df0:
 -------------
|id |quantity|
 -------------
|  a|       4|
|  b|       7|
|  c|       6|
|  d|       1|
 -------------
df1:
 ------------------------
|id |order_id|order_date|
 ------------------------
|  a|       x|2021-01-25|
|  a|       y|2021-01-23|
|  b|       z|2021-01-28|
|  b|       x|2021-01-20|
|  c|       y|2021-01-15|
|  d|       x|2021-01-18|
 ------------------------

and the result I want to get is the following:

 ----------------------------------
|id |quantity |order_id|order_date|
 ----------------------------------
|  a|       4 |       x|2021-01-25|
|  b|       7 |       z|2021-01-28|
|  c|       6 |       y|2021-01-15|
|  d|       1 |       x|2021-01-18|
 ----------------------------------

that is, I need to join only with the most recent record based on the order_date.

CodePudding user response:

Simply group df1 on id and aggregate max order_date then join the result with df0:

import pyspark.sql.functions as F

result = df0.join(
    df1.groupBy("id").agg(F.max("order_date").alias("order_date")),
    on=["id"]
)

result.show()
# --- -------- ---------- 
#| id|quantity|order_date|
# --- -------- ---------- 
#|  d|       1|2021-01-18|
#|  c|       6|2021-01-15|
#|  b|       7|2021-01-28|
#|  a|       4|2021-01-25|
# --- -------- ---------- 
  • Related