EDIT - Answer as below with the extra code new_df = df_hist[msk]
I'm trying to remove all rows from a DataFrame where the date ('RaceDate') is over 2 years ago. I convert the 'RaceDate' using pd.to_datetime
- this gives the format e.g. 2018-09-01
. I use datetime.datetime.now().date()
and this gives the format datetime.date(2022, 2, 10)
I can't work out how to check the difference is over two years as the code gives me Key error. Any help would be most welcome.
d2 = datetime.datetime.now().date()
df_hist['RaceDate'] = pd.to_datetime(df_hist['RaceDate'])
if df_hist.loc[df_hist['RaceDate']] - d2 > 730:
df_hist.loc[df_hist['RaceDate']] = ""
Sample data:
Jockey Code Course RaceDate Ran FPos TotalBtn
Harry Burns AW Chelmsford City 01-Sep-18 9 4 4.5
Harry Burns AW Newcastle 01-Oct-21 14 6 4.75
Harry Burns AW Kempton 01-Nov-21 11 4 4.25
Harry Burns AW Southwell 01-Jan-22 12 4 3.75
Harry Burns AW Wolverhampton 01-Feb-22 12 5 4.75
Christian Howarth AW Kempton 01-Dec-21 14 5 3.5
Harry Davies AW
Jonathan Fisher AW Southwell 01-Dec-17 14 9 10.5
Jonathan Fisher AW Chelmsford City 01-Dec-17 10 8 4.82
Jonathan Fisher AW Southwell 01-Jan-22 12 5 7.75
Jonathan Fisher AW Southwell 01-Jan-22 14 4 3
Jonathan Fisher AW Wolverhampton 01-Feb-22 7 4 2
Jonathan Fisher AW Wolverhampton 01-Feb-22 11 8 3
CodePudding user response:
You can convert d2
using pd.to_datetime
and subtract (since today's date is greater than yesterday's date, we subtract today's date from the column using rsub
). Then convert the timedelta to year and evaluate if the time difference in years is less than 2:
msk = df_hist['RaceDate'].rsub(pd.to_datetime(d2)).astype('timedelta64[Y]') < 2
races_dates_in_last_2_years = df_hist.loc[msk, 'RaceDate']
or if you wish you filter the entire DataFrame:
new_df = df_hist[msk]
For example:
For the below Series:
s = pd.to_datetime(pd.Series(['2020-02-10', '2020-02-11']))
that looks like:
0 2020-02-10
1 2020-02-11
dtype: datetime64[ns]
the above code produces:
0 2020-02-11
dtype: datetime64[ns]