pandas drop tables with condition and converting value to int-CodePudding

I have a pandas dataframe and want to drop all rows with a start date smaller than 2019 and greater than 2020. For sure I can just iterate over it, do the condition, and drop it by index if it is False. For example like the following:

for index, row in df.iterrows():
    # extract year from date format YYYY-MM-DD
    year = int(row['START_DATE'][:4])

    # remove all dates before and after 2019/2020
    if not (year >= 2019 and year <= 2020):
        df = df.drop(index)

But my goal is to write code more effectively. And that is the point where I am stuck. I came to the following line:

df = df.drop(df[(int(df.START_DATE[:4]) < 2019) & (int(df.START_DATE[:4]) > 2020)].index)

but I get a TypeError: cannot convert the series to <class 'int'> and don't know how to convert the values to an int in this short statement.

CodePudding user response：

Use pd.to_datetime to check if the date is between your range then extract the year:

>>> df
   START_DATE  VAL
0  2018-12-31    1
1  2019-12-31    2
2  2020-12-31    3
3  2021-12-31    4


>>> df.loc[pd.to_datetime(df['START_DATE']).between('2019', '2021')] \
      .assign(START_DATE=df['START_DATE'].str[:4].astype(int))

   START_DATE  VAL
1        2019    2
2        2020    3

CodePudding user response：

First ensure that START_DATE column is in pd.datetime. Then filter them by your condition. ~ is a NOT operation in Pandas.

df["START_DATE"] = pd.to_datetime(df["START_DATE"])
df = df[~((df["START_DATE"].dt.year < 2019) | (df["START_DATE"].dt.year > 2020))]