I have the following dataframe:
| ID | date |
|---------------------|--------------------------------|
| 1 | 2022-02-03 22:01:12 01:00 |
| 2 | 2022-02-04 21:11:21 01:00 |
| 3 | 2022-02-05 11:11:21 01:00 |
| 4 | 2022-02-07 23:01:12 01:00 |
| 5 | 2022-02-07 14:31:14 02:00 |
| 6 | 2022-02-08 18:12:01 02:00 |
| 7 | 2022-02-09 20:21:02 02:00 |
| 8 | 2022-02-11 15:41:25 02:00 |
| 9 | 2022-02-15 11:21:27 02:00 |
I have made a function that, given two dates with the following format (YYYYY-MM-DD HH:MM:SS), obtains the subset of data between that interval. The code is as follows:
# Selects a subset of the dataset from a given time interval
def select_interval(df, start_date, end_date):
# Confirm the given format and convert to datetime
start_date = pd.to_datetime(start_date, format='%Y-%m-%d %H:%M:%S')
end_date = pd.to_datetime(end_date, format='%Y-%m-%d %H:%M:%S')
# Create a copy of the original df
subset = df.copy()
# Creates a temporary column to store the values related to the specific date
subset['tmp_date'] = subset['date'].apply(lambda a: pd.to_datetime(str(a.date()) " " str(a.time())))
if start_date < end_date:
mask = (subset['tmp_date'] >= start_date) & (subset['tmp_date'] <= end_date)
df = df.loc[mask]
return df
I need to make the additional column constructed from the date and time because if I directly compare the dates passed by parameter with the values of the date column (which contain the timezone) it gives the following error: TypeError: can't compare offset-naive and offset-aware datetimes
I would like to know if there is a way to solve this problem in a more optimal way, because I think that creating the tmp_date
column makes my function less efficient. Thank you for your help.
CodePudding user response:
You can change the start_date
& end_date
to timezone aware before passing the parameter to the function as below.
import pytz
start_date = pytz.utc.localize(start_date)
end_date = pytz.utc.localize(end_date)
CodePudding user response:
pd.to_datetime
with utc=True
You can pass the optional parameter utc=True
to pd.to_datetime
function in order to convert the timezone-aware inputs to UTC. Then you should be able to compare the date column with start_date
and end_date
in order to subset the df
Here is the step by step example,
print(df)
ID date
0 1 2022-02-03 22:01:12 01:00
1 2 2022-02-04 21:11:21 01:00
2 3 2022-02-05 11:11:21 01:00
3 4 2022-02-07 23:01:12 01:00
4 5 2022-02-07 14:31:14 02:00
5 6 2022-02-08 18:12:01 02:00
6 7 2022-02-09 20:21:02 02:00
7 8 2022-02-11 15:41:25 02:00
8 9 2022-02-15 11:21:27 02:00
# Convert to UTC
df['date'] = pd.to_datetime(df['date'], utc=True)
print(df)
ID date
0 1 2022-02-03 21:01:12 00:00
1 2 2022-02-04 20:11:21 00:00
2 3 2022-02-05 10:11:21 00:00
3 4 2022-02-07 22:01:12 00:00
4 5 2022-02-07 12:31:14 00:00
5 6 2022-02-08 16:12:01 00:00
6 7 2022-02-09 18:21:02 00:00
7 8 2022-02-11 13:41:25 00:00
8 9 2022-02-15 09:21:27 00:00
# Filter the rows with boolean indexing
subset = df[df['date'].between('2022-02-03 21:01:12', '2022-02-07 22:01:11')]
print(subset)
ID date
0 1 2022-02-03 21:01:12 00:00
1 2 2022-02-04 20:11:21 00:00
2 3 2022-02-05 10:11:21 00:00
4 5 2022-02-07 12:31:14 00:00