df_1 =
scope letter date
000 ABC 2022-07-01
100 ABC 2022-08-01
101 DEF 2022-08-02
102 CCC 2022-08-15
103 ABC 2022-08-16
104 ABC 2022-08-31
105 CCC 2022-09-01
106 DEF 2022-09-02
df_2 =
letter date
ABC 2022-08-01
DEF 2022-08-02
CCC 2022-09-02
How do I drop the rows in df_1
for which the date
is later than the date in df_2
? I would expect:
df_1_new =
scope letter date
000 ABC 2022-07-01
100 ABC 2022-08-01
101 DEF 2022-08-02
102 CCC 2022-08-15
105 CCC 2022-09-01
I did a join, but it drops earlier dates too, which I do not want. Is it possible to do a join and then write an if-statement? Or is a join avoidable?
CodePudding user response:
You can use Semi Join
.
df_1_new = df_1.join(
df_2, [df_1["letter"] == df_2["letter"], df_1["date"] <= df_2["date"]], how="semi"
)
CodePudding user response:
Try with this:
df_1['max'] = df_1['letter'].map(df_2.set_index('letter').to_dict()['date'])
df_1_new = df_1[df_1.date <= df_1['max']].drop(columns='max')
CodePudding user response:
Whenever possible, I would avoid using symbols >
, <
, >=
, <=
inside join
conditions, because in some cases it could generate a BroadcastNestedLoopJoin
- a cross join which would drag the performance.
Inputs:
df_1 = spark.createDataFrame(
[('000', 'ABC', '2022-07-01'),
('100', 'ABC', '2022-08-01'),
('101', 'DEF', '2022-08-02'),
('102', 'CCC', '2022-08-15'),
('103', 'ABC', '2022-08-16'),
('104', 'ABC', '2022-08-31'),
('105', 'CCC', '2022-09-01'),
('106', 'DEF', '2022-09-02')],
['scope', 'letter', 'date'])
df_2 = spark.createDataFrame(
[('ABC', '2022-08-01'),
('DEF', '2022-08-02'),
('CCC', '2022-09-02')],
['letter', 'date'])
Script:
df_1_new = (df_1
.join(df_2.withColumnRenamed('date', 'date2'), 'letter', 'left')
.filter("date <= date2")
.select(df_1['*'])
)
df_1_new.show()
# ----- ------ ----------
# |scope|letter| date|
# ----- ------ ----------
# | 000| ABC|2022-07-01|
# | 100| ABC|2022-08-01|
# | 102| CCC|2022-08-15|
# | 105| CCC|2022-09-01|
# | 101| DEF|2022-08-02|
# ----- ------ ----------
CodePudding user response:
So first we make a dictionary of df_2
for easy/fast comparison
d = dict(zip(df_2.letter, df_2.date))
d
{
"ABC": Timestamp("2022-08-01 00:00:00"),
"DEF": Timestamp("2022-08-02 00:00:00"),
"CCC": Timestamp("2022-09-02 00:00:00"),
}
Next we select rows from df_1
that are confirming to the condition
df_1[df_1.apply(lambda x: False if x.date < d[x.letter] else True, axis=1)]
scope letter date
1 100 ABC 2022-08-01
2 101 DEF 2022-08-02
4 103 ABC 2022-08-16
5 104 ABC 2022-08-31
7 106 DEF 2022-09-02