Home > OS >  Filtering dataframe dates based on dates in another dataframe
Filtering dataframe dates based on dates in another dataframe

Time:08-29

df_1 = 

scope      letter            date
  000         ABC      2022-07-01
  100         ABC      2022-08-01
  101         DEF      2022-08-02
  102         CCC      2022-08-15
  103         ABC      2022-08-16
  104         ABC      2022-08-31
  105         CCC      2022-09-01
  106         DEF      2022-09-02

df_2 =

  letter               date
     ABC         2022-08-01
     DEF         2022-08-02
     CCC         2022-09-02

How do I drop the rows in df_1 for which the date is later than the date in df_2? I would expect:

df_1_new = 

scope      letter            date
  000         ABC      2022-07-01
  100         ABC      2022-08-01
  101         DEF      2022-08-02
  102         CCC      2022-08-15
  105         CCC      2022-09-01

I did a join, but it drops earlier dates too, which I do not want. Is it possible to do a join and then write an if-statement? Or is a join avoidable?

CodePudding user response:

You can use Semi Join.

df_1_new = df_1.join(
    df_2, [df_1["letter"] == df_2["letter"], df_1["date"] <= df_2["date"]], how="semi"
)

CodePudding user response:

Try with this:

df_1['max'] = df_1['letter'].map(df_2.set_index('letter').to_dict()['date'])
df_1_new = df_1[df_1.date <= df_1['max']].drop(columns='max')

CodePudding user response:

Whenever possible, I would avoid using symbols >, <, >=, <= inside join conditions, because in some cases it could generate a BroadcastNestedLoopJoin - a cross join which would drag the performance.

Inputs:

df_1 = spark.createDataFrame(
    [('000', 'ABC', '2022-07-01'),
     ('100', 'ABC', '2022-08-01'),
     ('101', 'DEF', '2022-08-02'),
     ('102', 'CCC', '2022-08-15'),
     ('103', 'ABC', '2022-08-16'),
     ('104', 'ABC', '2022-08-31'),
     ('105', 'CCC', '2022-09-01'),
     ('106', 'DEF', '2022-09-02')],
    ['scope', 'letter', 'date'])
df_2 = spark.createDataFrame(
    [('ABC', '2022-08-01'),
     ('DEF', '2022-08-02'),
     ('CCC', '2022-09-02')],
    ['letter', 'date'])

Script:

df_1_new = (df_1
    .join(df_2.withColumnRenamed('date', 'date2'), 'letter', 'left')
    .filter("date <= date2")
    .select(df_1['*'])
)

df_1_new.show()
#  ----- ------ ---------- 
# |scope|letter|      date|
#  ----- ------ ---------- 
# |  000|   ABC|2022-07-01|
# |  100|   ABC|2022-08-01|
# |  102|   CCC|2022-08-15|
# |  105|   CCC|2022-09-01|
# |  101|   DEF|2022-08-02|
#  ----- ------ ---------- 

CodePudding user response:

So first we make a dictionary of df_2 for easy/fast comparison

d = dict(zip(df_2.letter, df_2.date))
d

{
    "ABC": Timestamp("2022-08-01 00:00:00"),
    "DEF": Timestamp("2022-08-02 00:00:00"),
    "CCC": Timestamp("2022-09-02 00:00:00"),
}

Next we select rows from df_1 that are confirming to the condition

df_1[df_1.apply(lambda x: False if x.date < d[x.letter] else True, axis=1)]
    scope   letter  date
1   100     ABC     2022-08-01
2   101     DEF     2022-08-02
4   103     ABC     2022-08-16
5   104     ABC     2022-08-31
7   106     DEF     2022-09-02
  • Related