`combine_first` at DST changing day fails-CodePudding

I am trying to use combine_first to join two pandas series, so that one has precedence over the other. But it fails on the DST change day. I put together this demonstration:

import pandas as pd
import numpy as np

fr1 = pd.date_range(pd.to_datetime('2020-10-25').tz_localize('Europe/Berlin'), pd.to_datetime('2020-10-26').tz_localize('Europe/Berlin'), freq='H')
fr2 = fr1   pd.DateOffset(hours=12)
d1 = pd.Series(data=np.random.randint(0,10, len(fr1)), index = fr1)
d2 = pd.Series(data=np.random.randint(0,10, len(fr2)), index = fr2)
d2.combine_first(d1)

output:

ValueError: cannot reindex from a duplicate axis

Why is this? How can this be fixed? I can of course manually slice the series and concatenate.

CodePudding user response：

As you can see looking at the first items of your DateTimeIndex, the timezone as xxxx changes at the moment of DST change:

>>> date_range = pd.date_range(pd.to_datetime('2020-10-25').tz_localize('Europe/Berlin'), pd.to_datetime('2020-10-26').tz_localize('Europe/Berlin'), freq='H')
>>> date_range[:4]
DatetimeIndex(['2020-10-25 00:00:00 02:00', '2020-10-25 01:00:00 02:00',
               '2020-10-25 02:00:00 02:00', '2020-10-25 02:00:00 01:00'],

When you add the 12 hours, the dates change their timezones too:

DatetimeIndex(['2020-10-25 12:00:00 01:00', '2020-10-25 13:00:00 01:00',
               '2020-10-25 14:00:00 01:00', '2020-10-25 14:00:00 01:00'],
              dtype='datetime64[ns, Europe/Berlin]', freq=None)

However the difference in time is not 12 hours anymore, it is 11 due to the timezone change. That is because with DateOffset you ask for a « apparent » change of 12 hours, as in « the 2 different clock readings are 12 hours apart ». This means that then 2 timestamps are mapped to the same timestamp that appears 12 hours later.

It is a specificity of DateOffset to allow to express non-constant deltas, i.e. if you add « 1 month », you would similarly not add the same number of seconds to a date depending which month you’re in.

If you want a shift in 12 actual hours, use Timedelta:

>>> (date_range[:4]   pd.Timedelta(hours=12))
DatetimeIndex(['2020-10-25 11:00:00 01:00', '2020-10-25 12:00:00 01:00',
               '2020-10-25 13:00:00 01:00', '2020-10-25 14:00:00 01:00'],
              dtype='datetime64[ns, Europe/Berlin]', freq='H')
>>> (date_range   pd.Timedelta(hours=12)).is_unique
True

Now that the index is unique, combine_first will also work:

>>> d1 = pd.Series(data=np.random.randint(0, 10, 26), index=date_range)
>>> d2 = pd.Series(data=np.random.randint(0, 10, 26), index=date_range   pd.Timedelta(hours=12))
>>> d2.combine_first(d1)
2020-10-25 00:00:00 02:00    0.0
2020-10-25 01:00:00 02:00    1.0
2020-10-25 02:00:00 02:00    2.0
2020-10-25 02:00:00 01:00    1.0
2020-10-25 03:00:00 01:00    7.0
2020-10-25 04:00:00 01:00    4.0
2020-10-25 05:00:00 01:00    0.0
2020-10-25 06:00:00 01:00    1.0
2020-10-25 07:00:00 01:00    1.0
2020-10-25 08:00:00 01:00    1.0
2020-10-25 09:00:00 01:00    6.0
2020-10-25 10:00:00 01:00    7.0
2020-10-25 11:00:00 01:00    4.0
2020-10-25 12:00:00 01:00    8.0
2020-10-25 13:00:00 01:00    6.0
2020-10-25 14:00:00 01:00    0.0
2020-10-25 15:00:00 01:00    6.0
2020-10-25 16:00:00 01:00    0.0
2020-10-25 17:00:00 01:00    0.0
2020-10-25 18:00:00 01:00    9.0
2020-10-25 19:00:00 01:00    7.0
2020-10-25 20:00:00 01:00    9.0
2020-10-25 21:00:00 01:00    3.0
2020-10-25 22:00:00 01:00    4.0
2020-10-25 23:00:00 01:00    0.0
2020-10-26 00:00:00 01:00    6.0
2020-10-26 01:00:00 01:00    5.0
2020-10-26 02:00:00 01:00    9.0
2020-10-26 03:00:00 01:00    1.0
2020-10-26 04:00:00 01:00    4.0
2020-10-26 05:00:00 01:00    4.0
2020-10-26 06:00:00 01:00    3.0
2020-10-26 07:00:00 01:00    1.0
2020-10-26 08:00:00 01:00    8.0
2020-10-26 09:00:00 01:00    1.0
2020-10-26 10:00:00 01:00    6.0
2020-10-26 11:00:00 01:00    5.0
2020-10-26 12:00:00 01:00    6.0
Freq: H, dtype: float64