Home > Software design >  Join based on date difference
Join based on date difference

Time:07-01

I have the following 2 dataframes:

client Date
A 202001
A 202002
A 202105
B 202203
B 202001
C 202205
C 202207
C 202206
client Date
A 202004
B 202205

(real dataframes contain more dates and clients)

I now want to add a column to dataframe 1 that evaluates to 1 or 0 based on the following conditions:
1 if client is is in the second dataframe AND the date is maximum 4 months apart.
0 otherwise

Expected output:

client Date condition
A 202001 1
A 202002 1
A 202105 0
B 202203 1
B 202001 0
C 202205 0
C 202205 0
C 202206 0

CodePudding user response:

You could first, in the df2, create a sequence of dates and explode it. Then join.

Dataframes:

from pyspark.sql import functions as F

df1 = spark.createDataFrame(
    [('A', 202001),
     ('A', 202002),
     ('A', 202105),
     ('B', 202203),
     ('B', 202001),
     ('C', 202205),
     ('C', 202205),
     ('C', 202206)],
    ['client', 'Date'])
df2 = spark.createDataFrame(
    [('A', 202004),
     ('B', 202205)],
    ['client', 'Date'])

Script:

df2 = df2.withColumn('Date', F.to_date('Date', 'yyyyMM'))
df2 = df2.withColumn('Date', F.explode(F.expr("sequence(add_months(Date, -4), add_months(Date, 4), interval 1 month)")))
df2 = df2.select('client', F.date_format('Date', 'yyyyMM').alias('Date'), F.lit(1).alias('condition'))

df_result = df1.join(df2, ['client', 'Date'], 'left') \
               .fillna(0, 'condition')

df_result.show()
#  ------ ------ --------- 
# |client|  Date|condition|
#  ------ ------ --------- 
# |     A|202002|        1|
# |     A|202001|        1|
# |     B|202203|        1|
# |     A|202105|        0|
# |     C|202206|        0|
# |     C|202205|        0|
# |     C|202205|        0|
# |     B|202001|        0|
#  ------ ------ --------- 
  • Related