I have 2 PySpark dataframes and want to join on "ID", then on a year from "date1" and "date2" columns and then on month of the same date columns.
df1:
ID col1 date1
1 1 2018-01-05
1 2 2018-02-05
2 4 2018-04-05
2 1 2018-05-05
3 1 2019-01-05
3 4 2019-02-05
df2:
ID col2 date2
1 1 2018-01-08
1 1 2018-02-08
2 4 2018-04-08
2 3 2018-05-08
3 1 2019-01-08
3 4 2019-02-08
Expected output:
ID col1 date1 col2 date2
1 1 2018-01-05 1 2018-01-08
1 2 2018-02-05 1 2018-02-08
2 4 2018-04-05 4 2018-04-08
2 1 2018-05-05 3 2018-05-08
3 1 2019-01-05 1 2019-01-08
3 4 2019-02-05 4 2019-02-08
I tried something along the lines of:
df = df1.join(df2, (ID & (df1.F.year(date1) == df2.F.year(date2)) & (df1.F.month(date1) == df2.F.month(date2))
How to join on date's month and year?
CodePudding user response:
You can to it like this:
join_on = (df1.ID == df2.ID) & \
(F.year(df1.date1) == F.year(df2.date2)) & \
(F.month(df1.date1) == F.month(df2.date2))
df = df1.join(df2, join_on)
Full example:
from pyspark.sql import functions as F
df1 = spark.createDataFrame(
[(1, 1, '2018-01-05'),
(1, 2, '2018-02-05'),
(2, 4, '2018-04-05'),
(2, 1, '2018-05-05'),
(3, 1, '2019-01-05'),
(3, 4, '2019-02-05')],
['ID', 'col1', 'date1'])
df2 = spark.createDataFrame(
[(1, 1, '2018-01-08'),
(1, 1, '2018-02-08'),
(2, 4, '2018-04-08'),
(2, 3, '2018-05-08'),
(3, 1, '2019-01-08'),
(3, 4, '2019-02-08')],
['ID', 'col2', 'date2'])
join_on = (df1.ID == df2.ID) & \
(F.year(df1.date1) == F.year(df2.date2)) & \
(F.month(df1.date1) == F.month(df2.date2))
df = df1.join(df2, join_on).drop(df2.ID)
df.show()
# --- ---- ---------- ---- ----------
# | ID|col1| date1|col2| date2|
# --- ---- ---------- ---- ----------
# | 1| 1|2018-01-05| 1|2018-01-08|
# | 1| 2|2018-02-05| 1|2018-02-08|
# | 2| 4|2018-04-05| 4|2018-04-08|
# | 2| 1|2018-05-05| 3|2018-05-08|
# | 3| 1|2019-01-05| 1|2019-01-08|
# | 3| 4|2019-02-05| 4|2019-02-08|
# --- ---- ---------- ---- ----------