Join based on date difference-CodePudding

I have the following 2 dataframes:

client	Date
A	202001
A	202002
A	202105
B	202203
B	202001
C	202205
C	202207
C	202206

client	Date
A	202004
B	202205

(real dataframes contain more dates and clients)

I now want to add a column to dataframe 1 that evaluates to 1 or 0 based on the following conditions:
1 if client is is in the second dataframe AND the date is maximum 4 months apart.
0 otherwise

Expected output:

client	Date	condition
A	202001	1
A	202002	1
A	202105	0
B	202203	1
B	202001	0
C	202205	0
C	202205	0
C	202206	0

CodePudding user response：

You could first, in the df2, create a sequence of dates and explode it. Then join.

Dataframes:

from pyspark.sql import functions as F

df1 = spark.createDataFrame(
    [('A', 202001),
     ('A', 202002),
     ('A', 202105),
     ('B', 202203),
     ('B', 202001),
     ('C', 202205),
     ('C', 202205),
     ('C', 202206)],
    ['client', 'Date'])
df2 = spark.createDataFrame(
    [('A', 202004),
     ('B', 202205)],
    ['client', 'Date'])

Script:

df2 = df2.withColumn('Date', F.to_date('Date', 'yyyyMM'))
df2 = df2.withColumn('Date', F.explode(F.expr("sequence(add_months(Date, -4), add_months(Date, 4), interval 1 month)")))
df2 = df2.select('client', F.date_format('Date', 'yyyyMM').alias('Date'), F.lit(1).alias('condition'))

df_result = df1.join(df2, ['client', 'Date'], 'left') \
               .fillna(0, 'condition')

df_result.show()
#  ------ ------ --------- 
# |client|  Date|condition|
#  ------ ------ --------- 
# |     A|202002|        1|
# |     A|202001|        1|
# |     B|202203|        1|
# |     A|202105|        0|
# |     C|202206|        0|
# |     C|202205|        0|
# |     C|202205|        0|
# |     B|202001|        0|
#  ------ ------ ---------