I have the following 2 dataframes:
client | Date |
---|---|
A | 202001 |
A | 202002 |
A | 202105 |
B | 202203 |
B | 202001 |
C | 202205 |
C | 202207 |
C | 202206 |
client | Date |
---|---|
A | 202004 |
B | 202205 |
(real dataframes contain more dates and clients)
I now want to add a column to dataframe 1 that evaluates to 1
or 0
based on the following conditions:
1
if client is is in the second dataframe AND the date is maximum 4 months apart.
0
otherwise
Expected output:
client | Date | condition |
---|---|---|
A | 202001 | 1 |
A | 202002 | 1 |
A | 202105 | 0 |
B | 202203 | 1 |
B | 202001 | 0 |
C | 202205 | 0 |
C | 202205 | 0 |
C | 202206 | 0 |
CodePudding user response:
You could first, in the df2, create a sequence
of dates and explode
it. Then join.
Dataframes:
from pyspark.sql import functions as F
df1 = spark.createDataFrame(
[('A', 202001),
('A', 202002),
('A', 202105),
('B', 202203),
('B', 202001),
('C', 202205),
('C', 202205),
('C', 202206)],
['client', 'Date'])
df2 = spark.createDataFrame(
[('A', 202004),
('B', 202205)],
['client', 'Date'])
Script:
df2 = df2.withColumn('Date', F.to_date('Date', 'yyyyMM'))
df2 = df2.withColumn('Date', F.explode(F.expr("sequence(add_months(Date, -4), add_months(Date, 4), interval 1 month)")))
df2 = df2.select('client', F.date_format('Date', 'yyyyMM').alias('Date'), F.lit(1).alias('condition'))
df_result = df1.join(df2, ['client', 'Date'], 'left') \
.fillna(0, 'condition')
df_result.show()
# ------ ------ ---------
# |client| Date|condition|
# ------ ------ ---------
# | A|202002| 1|
# | A|202001| 1|
# | B|202203| 1|
# | A|202105| 0|
# | C|202206| 0|
# | C|202205| 0|
# | C|202205| 0|
# | B|202001| 0|
# ------ ------ ---------