I need to join these datarames:
df0:
-------------
|id |quantity|
-------------
| a| 4|
| b| 7|
| c| 6|
| d| 1|
-------------
df1:
------------------------
|id |order_id|order_date|
------------------------
| a| x|2021-01-25|
| a| y|2021-01-23|
| b| z|2021-01-28|
| b| x|2021-01-20|
| c| y|2021-01-15|
| d| x|2021-01-18|
------------------------
and the result I want to get is the following:
----------------------------------
|id |quantity |order_id|order_date|
----------------------------------
| a| 4 | x|2021-01-25|
| b| 7 | z|2021-01-28|
| c| 6 | y|2021-01-15|
| d| 1 | x|2021-01-18|
----------------------------------
that is, I need to join only with the most recent record based on the order_date
.
CodePudding user response:
Simply group df1
on id
and aggregate max order_date
then join the result with df0
:
import pyspark.sql.functions as F
result = df0.join(
df1.groupBy("id").agg(F.max("order_date").alias("order_date")),
on=["id"]
)
result.show()
# --- -------- ----------
#| id|quantity|order_date|
# --- -------- ----------
#| d| 1|2021-01-18|
#| c| 6|2021-01-15|
#| b| 7|2021-01-28|
#| a| 4|2021-01-25|
# --- -------- ----------