Home > Mobile >  PySpark join on ID then on year and month from 'date' column
PySpark join on ID then on year and month from 'date' column

Time:10-19

I have 2 PySpark dataframes and want to join on "ID", then on a year from "date1" and "date2" columns and then on month of the same date columns.

df1:

ID   col1     date1
1    1     2018-01-05
1    2     2018-02-05
2    4     2018-04-05
2    1     2018-05-05
3    1     2019-01-05
3    4     2019-02-05

df2:

ID   col2     date2
1    1     2018-01-08
1    1     2018-02-08
2    4     2018-04-08
2    3     2018-05-08
3    1     2019-01-08
3    4     2019-02-08

Expected output:

ID   col1     date1     col2     date2
1    1     2018-01-05    1     2018-01-08
1    2     2018-02-05    1     2018-02-08
2    4     2018-04-05    4     2018-04-08
2    1     2018-05-05    3     2018-05-08
3    1     2019-01-05    1     2019-01-08
3    4     2019-02-05    4     2019-02-08

I tried something along the lines of:

df = df1.join(df2, (ID & (df1.F.year(date1) == df2.F.year(date2)) & (df1.F.month(date1) == df2.F.month(date2))

How to join on date's month and year?

CodePudding user response:

You can to it like this:

join_on = (df1.ID == df2.ID) & \
          (F.year(df1.date1) == F.year(df2.date2)) & \
          (F.month(df1.date1) == F.month(df2.date2))
df = df1.join(df2, join_on)

Full example:

from pyspark.sql import functions as F
df1 = spark.createDataFrame(
    [(1, 1, '2018-01-05'),
     (1, 2, '2018-02-05'),
     (2, 4, '2018-04-05'),
     (2, 1, '2018-05-05'),
     (3, 1, '2019-01-05'),
     (3, 4, '2019-02-05')],
    ['ID', 'col1', 'date1'])
df2 = spark.createDataFrame(
    [(1, 1, '2018-01-08'),
     (1, 1, '2018-02-08'),
     (2, 4, '2018-04-08'),
     (2, 3, '2018-05-08'),
     (3, 1, '2019-01-08'),
     (3, 4, '2019-02-08')],
    ['ID', 'col2', 'date2'])

join_on = (df1.ID == df2.ID) & \
          (F.year(df1.date1) == F.year(df2.date2)) & \
          (F.month(df1.date1) == F.month(df2.date2))
df = df1.join(df2, join_on).drop(df2.ID)

df.show()
#  --- ---- ---------- ---- ---------- 
# | ID|col1|     date1|col2|     date2|
#  --- ---- ---------- ---- ---------- 
# |  1|   1|2018-01-05|   1|2018-01-08|
# |  1|   2|2018-02-05|   1|2018-02-08|
# |  2|   4|2018-04-05|   4|2018-04-08|
# |  2|   1|2018-05-05|   3|2018-05-08|
# |  3|   1|2019-01-05|   1|2019-01-08|
# |  3|   4|2019-02-05|   4|2019-02-08|
#  --- ---- ---------- ---- ---------- 
  • Related