I have a dataframe including columns Date
& Time
, which are local clock readings (as strings). and the column dst
, which indicates whether Daylight Saving Time is active or not using W
for winter and S
for summer.
I know the timezone is Europe/Berlin
, resulting in a 1 hour shift from UTC during wintertime and a 2 hour shift during summertime.
I am very unhappy with this representation and want to convert to aware datetime objects in UTC and only give human readable times when needed.
Date Time dst
27.03.2022 01:15:00 W
27.03.2022 01:30:00 W
27.03.2022 01:45:00 W
27.03.2022 03:00:00 S
27.03.2022 03:15:00 S
27.03.2022 03:30:00 S
27.03.2022 03:45:00 S
27.03.2022 04:00:00 S
27.03.2022 04:15:00 S
27.03.2022 04:30:00 S
27.03.2022 04:45:00 S
27.03.2022 05:00:00 S
27.03.2022 05:15:00 S
My first approach was, retrieving a datetime object using pandas, localize it and subtracting two or one hours depending on the given dst, using numpy.
from datetime import datetime, timedelta, timezone
from dateutil import tz
import numpy as np
import pandas as pd
df['datetime'] = pd.to_datetime(df['Date'] df['Time'], format='%d.%m.%Y%H:%M:%S')
df['datetime_aware'] = df['datetime'].dt.tz_localize(tz='Europe/Berlin')
df['datetime_aware_subtracted'] = np.where(df['dst']=='S', df['datetime_aware']-timedelta(hours=2),
df['datetime_aware']-timedelta(hours=1))
This yields almost correct results, except between 03:00 - 05:00 (in datetime
) the datetime_aware_subtracted
column yields wrong result. One hour too much subtracted one hour timeshift too little. I feel like substracting time over the dst boundary is not a good idea.
datetime datetime_aware datetime_aware_subtracted
27.03.2022 01:15 2022-03-27 01:15:00 01:00 2022-03-27 00:15:00 01:00
27.03.2022 01:30 2022-03-27 01:30:00 01:00 2022-03-27 00:30:00 01:00
27.03.2022 01:45 2022-03-27 01:45:00 01:00 2022-03-27 00:45:00 01:00
27.03.2022 03:00 2022-03-27 03:00:00 02:00 2022-03-27 00:00:00 01:00
27.03.2022 03:15 2022-03-27 03:15:00 02:00 2022-03-27 00:15:00 01:00
27.03.2022 03:30 2022-03-27 03:30:00 02:00 2022-03-27 00:30:00 01:00
27.03.2022 03:45 2022-03-27 03:45:00 02:00 2022-03-27 00:45:00 01:00
27.03.2022 04:00 2022-03-27 04:00:00 02:00 2022-03-27 01:00:00 01:00
27.03.2022 04:15 2022-03-27 04:15:00 02:00 2022-03-27 01:15:00 01:00
27.03.2022 04:30 2022-03-27 04:30:00 02:00 2022-03-27 01:30:00 01:00
27.03.2022 04:45 2022-03-27 04:45:00 02:00 2022-03-27 01:45:00 01:00
27.03.2022 05:00 2022-03-27 05:00:00 02:00 2022-03-27 03:00:00 02:00
27.03.2022 05:15 2022-03-27 05:15:00 02:00 2022-03-27 03:15:00 02:00
My second approach, was to inverse subtraction and localizing.
df['datetime'] = pd.to_datetime(df['Date'] df['Time'], format='%d.%m.%Y%H:%M:%S')
df['datetime_subtracted'] = np.where(df['dst']=='S', df['datetime']-timedelta(hours=2),
df['datetime']-timedelta(hours=1))
df['datetime_subtracted_aware'] = df['datetime_subtracted'].dt.tz_localize(tz='Europe/Berlin')
This gives the correct naive results but after subtracting gives an NonExistentTimeError
when localizing (rightfully so).
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\ProgramData\Miniconda3\envs\env\lib\site-packages\pandas\core\accessor.py", line 94, in f
return self._delegate_method(name, *args, **kwargs)
File "C:\ProgramData\Miniconda3\envs\env\lib\site-packages\pandas\core\indexes\accessors.py", line 123, in _delegate_method
result = method(*args, **kwargs)
File "C:\ProgramData\Miniconda3\envs\env\lib\site-packages\pandas\core\indexes\datetimes.py", line 273, in tz_localize
arr = self._data.tz_localize(tz, ambiguous, nonexistent)
File "C:\ProgramData\Miniconda3\envs\env\lib\site-packages\pandas\core\arrays\_mixins.py", line 84, in method
return meth(self, *args, **kwargs)
File "C:\ProgramData\Miniconda3\envs\env\lib\site-packages\pandas\core\arrays\datetimes.py", line 1043, in tz_localize
new_dates = tzconversion.tz_localize_to_utc(
File "pandas\_libs\tslibs\tzconversion.pyx", line 328, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc
pytz.exceptions.NonExistentTimeError: 2022-03-27 02:00:00
What are my best options to aware UTC datetime objects in Europe/Berlin
timezone? In other words, create a new column datetime_aware_localized
that shows the same date&time as the datetime
column but is an aware datetime object in correct timezone.
CodePudding user response:
You can tell tz_localize
how to handle non-existing times:
nonexistentstr, default ‘raise’
A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST. Valid values are:
‘shift_forward’ will shift the nonexistent time forward to the closest existing time
‘shift_backward’ will shift the nonexistent time backward to the closest existing time
‘NaT’ will return NaT where there are nonexistent times
timedelta objects will shift nonexistent times by the timedelta
‘raise’ will raise an NonExistentTimeError if there are nonexistent times.
Replacing with NaT:
dt = pd.to_datetime(df['Date'] ' ' df['Time'])
corr = pd.to_timedelta(df['dst'].map({'W': 1, 'S': 2}), unit='H')
# converting to NaT
dt.sub(corr).dt.tz_localize(tz='Europe/Berlin', nonexistent='NaT')
output:
0 2022-03-27 00:15:00 01:00
1 2022-03-27 00:30:00 01:00
2 2022-03-27 00:45:00 01:00
3 2022-03-27 01:00:00 01:00
4 2022-03-27 01:15:00 01:00
5 2022-03-27 01:30:00 01:00
6 2022-03-27 01:45:00 01:00
7 NaT
8 NaT
9 NaT
10 NaT
11 2022-03-27 03:00:00 02:00
12 2022-03-27 03:15:00 02:00
dtype: datetime64[ns, Europe/Berlin]
shifting forward:
dt.sub(corr).dt.tz_localize(tz='Europe/Berlin', nonexistent='shift_forward')
0 2022-03-27 00:15:00 01:00
1 2022-03-27 00:30:00 01:00
2 2022-03-27 00:45:00 01:00
3 2022-03-27 01:00:00 01:00
4 2022-03-27 01:15:00 01:00
5 2022-03-27 01:30:00 01:00
6 2022-03-27 01:45:00 01:00
7 2022-03-27 03:00:00 02:00
8 2022-03-27 03:00:00 02:00
9 2022-03-27 03:00:00 02:00
10 2022-03-27 03:00:00 02:00
11 2022-03-27 03:00:00 02:00
12 2022-03-27 03:15:00 02:00
dtype: datetime64[ns, Europe/Berlin]
CodePudding user response:
to clarify, localization works perfectly fine like
pd.to_datetime(df['Date'] df['Time'], format='%d.%m.%Y%H:%M:%S').dt.tz_localize("Europe/Berlin")
0 2022-03-27 01:15:00 01:00
1 2022-03-27 01:30:00 01:00
2 2022-03-27 01:45:00 01:00
3 2022-03-27 03:00:00 02:00
4 2022-03-27 03:15:00 02:00
5 2022-03-27 03:30:00 02:00
6 2022-03-27 03:45:00 02:00
7 2022-03-27 04:00:00 02:00
8 2022-03-27 04:15:00 02:00
9 2022-03-27 04:30:00 02:00
10 2022-03-27 04:45:00 02:00
11 2022-03-27 05:00:00 02:00
12 2022-03-27 05:15:00 02:00
dtype: datetime64[ns, Europe/Berlin]
If you need UTC, use
pd.to_datetime(df['Date'] df['Time'], format='%d.%m.%Y%H:%M:%S').dt.tz_localize("Europe/Berlin").dt.tz_convert("UTC")
0 2022-03-27 00:15:00 00:00
1 2022-03-27 00:30:00 00:00
2 2022-03-27 00:45:00 00:00
3 2022-03-27 01:00:00 00:00
4 2022-03-27 01:15:00 00:00
5 2022-03-27 01:30:00 00:00
6 2022-03-27 01:45:00 00:00
7 2022-03-27 02:00:00 00:00
8 2022-03-27 02:15:00 00:00
9 2022-03-27 02:30:00 00:00
10 2022-03-27 02:45:00 00:00
11 2022-03-27 03:00:00 00:00
12 2022-03-27 03:15:00 00:00
dtype: datetime64[ns, UTC]
regarding NonExistentTimeError
error, you can find an example here, where this is actually a problem and keyword nonexistent
is necessary.