I have a vector of dates
L = [Timestamp('2018-07-15 00:00:00'),
Timestamp('2019-07-15 00:00:00')]
and a dataframe with a date column:
df = c1 c2 Date
1. 2. 2018-07-13 16:00:00
1. 7. 2018-07-15 16:00:00
3. 7. 2018-07-15 16:50:00
4. 7. 2018-07-15 19:50:00
2. 2. 2018-07-16 16:00:00
5. 1. 2020-10-10 16:00:00
8. 4. 2018-06-13 16:00:00
5. 4. 2021-12-13 16:00:00
2. 9. 2019-01-01 16:00:00
2. 6. 2019-01-01 17:00:00
I want to add a column that is the rank in the dates vector, and delete rows from the same date (regardless of hour/minutes). So I will have:
df = c1 c2 d. new
1. 2. 2018-07-13 16:00:00 0
1. 4. 2018-06-13 16:00:00 0
2. 2. 2018-07-16 16:00:00 1
5. 1. 2020-10-10 16:00:00 2
8. 4. 2018-06-13 16:00:00 0
5. 4. 2021-12-13 16:00:00 2
2. 9. 2019-01-01 16:00:00 1
2. 6. 2019-01-01 17:00:00. 1
What is the best way to do it?
CodePudding user response:
You can compare 2 numpy arrays with broadcasting, for count True
s use sum
, last filter out matched dates without times with Series.dt.normalize
and Series.isin
with invert mask by ~
:
L = [pd.Timestamp('2018-07-15 00:00:00'),pd.Timestamp('2019-07-15 00:00:00')]
df['Date'] = pd.to_datetime(df['Date'])
df['new'] = (df['Date'].to_numpy()[:, None] > pd.to_datetime(L).to_numpy()).sum(axis=1)
df1 = df[~df['Date'].dt.normalize().isin(L)]
print (df1)
c1 c2 Date new
0 1.0 2.0 2018-07-13 16:00:00 0
4 2.0 2.0 2018-07-16 16:00:00 1
5 5.0 1.0 2020-10-10 16:00:00 2
6 8.0 4.0 2018-06-13 16:00:00 0
7 5.0 4.0 2021-12-13 16:00:00 2
8 2.0 9.0 2019-01-01 16:00:00 1
9 2.0 6.0 2019-01-01 17:00:00 1