I have a pandas dataframe and want to drop all rows with a start date smaller than 2019 and greater than 2020. For sure I can just iterate over it, do the condition, and drop it by index if it is False. For example like the following:
for index, row in df.iterrows():
# extract year from date format YYYY-MM-DD
year = int(row['START_DATE'][:4])
# remove all dates before and after 2019/2020
if not (year >= 2019 and year <= 2020):
df = df.drop(index)
But my goal is to write code more effectively. And that is the point where I am stuck. I came to the following line:
df = df.drop(df[(int(df.START_DATE[:4]) < 2019) & (int(df.START_DATE[:4]) > 2020)].index)
but I get a TypeError: cannot convert the series to <class 'int'> and don't know how to convert the values to an int in this short statement.
CodePudding user response:
Use pd.to_datetime
to check if the date is between your range then extract the year:
>>> df
START_DATE VAL
0 2018-12-31 1
1 2019-12-31 2
2 2020-12-31 3
3 2021-12-31 4
>>> df.loc[pd.to_datetime(df['START_DATE']).between('2019', '2021')] \
.assign(START_DATE=df['START_DATE'].str[:4].astype(int))
START_DATE VAL
1 2019 2
2 2020 3
CodePudding user response:
First ensure that START_DATE
column is in pd.datetime
. Then filter them by your condition. ~
is a NOT
operation in Pandas
.
df["START_DATE"] = pd.to_datetime(df["START_DATE"])
df = df[~((df["START_DATE"].dt.year < 2019) | (df["START_DATE"].dt.year > 2020))]