Extract a subset given two dates from a python dataframe with timezone date format-CodePudding

I have the following dataframe:

|      ID             |     date                       |
|---------------------|--------------------------------|
|          1          |     2022-02-03 22:01:12 01:00  |
|          2          |     2022-02-04 21:11:21 01:00  |
|          3          |     2022-02-05 11:11:21 01:00  |
|          4          |     2022-02-07 23:01:12 01:00  |
|          5          |     2022-02-07 14:31:14 02:00  |
|          6          |     2022-02-08 18:12:01 02:00  |
|          7          |     2022-02-09 20:21:02 02:00  |
|          8          |     2022-02-11 15:41:25 02:00  |
|          9          |     2022-02-15 11:21:27 02:00  |

I have made a function that, given two dates with the following format (YYYYY-MM-DD HH:MM:SS), obtains the subset of data between that interval. The code is as follows:

# Selects a subset of the dataset from a given time interval
def select_interval(df, start_date, end_date):
    # Confirm the given format and convert to datetime
    start_date = pd.to_datetime(start_date, format='%Y-%m-%d %H:%M:%S')
    end_date = pd.to_datetime(end_date, format='%Y-%m-%d %H:%M:%S')
    # Create a copy of the original df
    subset = df.copy()
    # Creates a temporary column to store the values related to the specific date
    subset['tmp_date'] = subset['date'].apply(lambda a: pd.to_datetime(str(a.date())   " "   str(a.time())))
    if start_date < end_date:
        mask = (subset['tmp_date'] >= start_date) & (subset['tmp_date'] <= end_date)
        df = df.loc[mask]
    
    return df

I need to make the additional column constructed from the date and time because if I directly compare the dates passed by parameter with the values of the date column (which contain the timezone) it gives the following error: TypeError: can't compare offset-naive and offset-aware datetimes

I would like to know if there is a way to solve this problem in a more optimal way, because I think that creating the tmp_date column makes my function less efficient. Thank you for your help.

CodePudding user response：

You can change the start_date & end_date to timezone aware before passing the parameter to the function as below.

import pytz
start_date = pytz.utc.localize(start_date)
end_date = pytz.utc.localize(end_date)

CodePudding user response：

`pd.to_datetime` with `utc=True`

You can pass the optional parameter utc=True to pd.to_datetime function in order to convert the timezone-aware inputs to UTC. Then you should be able to compare the date column with start_date and end_date in order to subset the df

Here is the step by step example,

print(df)

   ID                       date
0   1  2022-02-03 22:01:12 01:00
1   2  2022-02-04 21:11:21 01:00
2   3  2022-02-05 11:11:21 01:00
3   4  2022-02-07 23:01:12 01:00
4   5  2022-02-07 14:31:14 02:00
5   6  2022-02-08 18:12:01 02:00
6   7  2022-02-09 20:21:02 02:00
7   8  2022-02-11 15:41:25 02:00
8   9  2022-02-15 11:21:27 02:00

# Convert to UTC
df['date'] = pd.to_datetime(df['date'], utc=True)
print(df)

   ID                      date
0   1 2022-02-03 21:01:12 00:00
1   2 2022-02-04 20:11:21 00:00
2   3 2022-02-05 10:11:21 00:00
3   4 2022-02-07 22:01:12 00:00
4   5 2022-02-07 12:31:14 00:00
5   6 2022-02-08 16:12:01 00:00
6   7 2022-02-09 18:21:02 00:00
7   8 2022-02-11 13:41:25 00:00
8   9 2022-02-15 09:21:27 00:00

# Filter the rows with boolean indexing
subset = df[df['date'].between('2022-02-03 21:01:12', '2022-02-07 22:01:11')]
print(subset)

   ID                      date
0   1 2022-02-03 21:01:12 00:00
1   2 2022-02-04 20:11:21 00:00
2   3 2022-02-05 10:11:21 00:00
4   5 2022-02-07 12:31:14 00:00

pd.to_datetime with utc=True

`pd.to_datetime` with `utc=True`