Home > OS >  Python: remove all rows from a dataframe where the date is over 2 years ago
Python: remove all rows from a dataframe where the date is over 2 years ago

Time:02-11

EDIT - Answer as below with the extra code new_df = df_hist[msk]

I'm trying to remove all rows from a DataFrame where the date ('RaceDate') is over 2 years ago. I convert the 'RaceDate' using pd.to_datetime - this gives the format e.g. 2018-09-01. I use datetime.datetime.now().date() and this gives the format datetime.date(2022, 2, 10)

I can't work out how to check the difference is over two years as the code gives me Key error. Any help would be most welcome.

d2 = datetime.datetime.now().date()
df_hist['RaceDate'] = pd.to_datetime(df_hist['RaceDate'])   

if  df_hist.loc[df_hist['RaceDate']] - d2 > 730:
    df_hist.loc[df_hist['RaceDate']] = ""

Sample data:

Jockey  Code    Course  RaceDate    Ran FPos    TotalBtn
Harry Burns AW  Chelmsford City 01-Sep-18   9   4   4.5
Harry Burns AW  Newcastle   01-Oct-21   14  6   4.75
Harry Burns AW  Kempton 01-Nov-21   11  4   4.25
Harry Burns AW  Southwell   01-Jan-22   12  4   3.75
Harry Burns AW  Wolverhampton   01-Feb-22   12  5   4.75
Christian Howarth   AW  Kempton 01-Dec-21   14  5   3.5
Harry Davies    AW                  
Jonathan Fisher AW  Southwell   01-Dec-17   14  9   10.5
Jonathan Fisher AW  Chelmsford City 01-Dec-17   10  8   4.82
Jonathan Fisher AW  Southwell   01-Jan-22   12  5   7.75
Jonathan Fisher AW  Southwell   01-Jan-22   14  4   3
Jonathan Fisher AW  Wolverhampton   01-Feb-22   7   4   2
Jonathan Fisher AW  Wolverhampton   01-Feb-22   11  8   3

CodePudding user response:

You can convert d2 using pd.to_datetime and subtract (since today's date is greater than yesterday's date, we subtract today's date from the column using rsub). Then convert the timedelta to year and evaluate if the time difference in years is less than 2:

msk = df_hist['RaceDate'].rsub(pd.to_datetime(d2)).astype('timedelta64[Y]') < 2
races_dates_in_last_2_years = df_hist.loc[msk, 'RaceDate']

or if you wish you filter the entire DataFrame:

new_df = df_hist[msk]

For example:

For the below Series:

s = pd.to_datetime(pd.Series(['2020-02-10', '2020-02-11']))

that looks like:

0   2020-02-10
1   2020-02-11
dtype: datetime64[ns]

the above code produces:

0   2020-02-11
dtype: datetime64[ns]
  • Related