I need to filter rows in a dataframe, based on the value of the "date" column. This is a sample of such a column:
2022-09-08 09:00:00-05
2022-09-08 12:00:00-05
2022-09-08 15:00:00-05
2022-09-08 18:00:00-05
2022-09-09 09:00:00-05
2022-09-09 12:00:00-05
2022-09-09 15:00:00-05
2022-09-09 18:00:00-05
2022-09-10 09:00:00-05
2022-09-10 12:00:00-05
2022-09-10 15:00:00-05
Dtype is set to:
datetime64[ns, UTC-05:00]
The "end_date" is determined by the max value of the "date" column, and the "start_date" is the "end_date" minus 14 days. I don't really need the time part, so I've removed it using .date()
data_df = pd.DataFrame(data=data, columns=columns)
end_date = pd.to_datetime(data_df['date'].max().date())
start_date = pd.to_datetime(end_date - timedelta(days=14))
mask = (data_df['date'] > start_date) & (data_df['date'] <= end_date)
filtered_df = data_df.loc[mask]
When I run the code, I get the following error:
TypeError: Cannot compare tz-naive and tz-aware datetime-like objects TypeError: Invalid comparison between dtype=datetime64[ns, UTC-05:00] and Timestamp
How can I fix this?
CodePudding user response:
# Convert the whole column to datetime [If necessary]
# df.date = pd.to_datetime(df.date)
# Get max date
max_date = df.date.max().date()
# Here I use '1d' (1 day) for an interesting result.
# You'll use '14d' instead.
mask = df.date.dt.date.between(max_date - pd.to_timedelta('1d'), max_date)
f_df = df.loc[mask]
print(f_df)
# Output:
date
4 2022-09-09 09:00:00-05:00
5 2022-09-09 12:00:00-05:00
6 2022-09-09 15:00:00-05:00
7 2022-09-09 18:00:00-05:00
8 2022-09-10 09:00:00-05:00
9 2022-09-10 12:00:00-05:00
10 2022-09-10 15:00:00-05:00
TLDR;
Both times you do data_df['date']
should be data_df['date'].dt.date
instead if you want to use your current format.