TimeSeries conversion from CET / CEST to UTC-CodePudding

I have two time series file which is meant to be in CET / CEST. The bad one of them, does not write the values in a proper way. For the good csv, see here:

#test_good.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 01:00,7224
2017-10-29 02:00,7225
2017-10-29 02:00,7226
2017-10-29 03:00,7227
...

...everything works fine by using:

        df['utc_time'] = pd.to_datetime(df[local_time_column])
                            .dt.tz_localize('CET', ambiguous="infer")
                            .dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')

When converting the test_bad.csv to UTC I get an AmbiguousTimeError as the 2 hours in October are missing.

# test_bad.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017   # everything is as it should be
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 01:00,7223
2017-10-29 02:00,7224   # the value of 2 am should actually be repeated PLUS 3 am is missing
2017-10-29 04:00,7226
2017-10-29 05:00,7227
...

Does anyone know an elegant way of how to still convert the time series file to UTC and add NaN columns for the missing dates in the new index? Thanks for your help.

CodePudding user response：

Elaborating a bit on Mark Ransom's comment;

2017-10-29 02:00,7224

is ambiguous; it could be 2017-10-29 00:00 UTC or 2017-10-29 01:00 UTC. That's why pd.to_datetime refuses to infer anything.

With the help of some native Python, you can work around. Assuming you just loaded the csv to a df without parsing anything to datetime, you can continue like

from datetime import datetime
import pytz

df['local_time'] = [pytz.timezone('Europe/Berlin').localize(datetime.fromisoformat(t)) for t in df['local_time']]

# so you can make a UTC index:
df.set_index(df['local_time'].dt.tz_convert('UTC'), inplace=True)

# Now you can create a new, hourly index from that and re-index:
dti = pd.date_range(df.index[0], df.index[-1], freq='H')
df2 = df.reindex(dti)

# for comparison, the "re-created" local_time column:
df2['local_time'] = df2.index.tz_convert('Europe/Berlin').strftime('%Y-%m-%d %H:%M:%S').values

that should give you something like

df2
                            value           local_time
2017-03-25 23:00:00 00:00  2016.0  2017-03-26 00:00:00
2017-03-26 00:00:00 00:00  2017.0  2017-03-26 01:00:00
2017-03-26 01:00:00 00:00  2018.0  2017-03-26 03:00:00
2017-03-26 02:00:00 00:00  2019.0  2017-03-26 04:00:00
2017-03-26 03:00:00 00:00     NaN  2017-03-26 05:00:00
                          ...                  ...
2017-10-29 00:00:00 00:00     NaN  2017-10-29 02:00:00
2017-10-29 01:00:00 00:00  7224.0  2017-10-29 02:00:00 # note: value randomly attributed to "second" 2 am
2017-10-29 02:00:00 00:00     NaN  2017-10-29 03:00:00
2017-10-29 03:00:00 00:00  7226.0  2017-10-29 04:00:00
2017-10-29 04:00:00 00:00  7227.0  2017-10-29 05:00:00

As commented above, the value 7224 is attributed to 2017-10-29 01:00:00 UTC, but it could as well be attributed to 2017-10-29 00:00:00 UTC If you don't care, you're fine. If that's a problem, best you can do in my opinion is discard the value. You can do so by using

df['local_time'] = pd.to_datetime(df['local_time']).dt.tz_localize('Europe/Berlin', ambiguous='NaT')

instead of the native Python part in the code above.

CodePudding user response：

Just to provide the solution which I am using for this workaround: It uses some try: except: functionality in case of the ambiguous time error. This should on the one hand side convert the time vector to UTC while it also fills missing values by reindexin. Feel free to suggest improvements.

try: # here everything is as expected and one hour is missing in Mar and one hour is repeated in Oct

# Localize tz-naive index of the DataFrame to target time zone.
df['time'] = df.iloc[:,0].dt.tz_localize('CET', ambiguous='infer').dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
df = df.set_index(pd.to_datetime(df['time'], utc=True))

# Create a complete time vector in UTC for latter reindexing
idx = pd.date_range(df.index.min(), df.index.max(), freq=freq, tz='UTC')

# Verify that time vector is complete
if len(np.unique(np.diff(df.index))) == 1:
    print('Time vector is complete!')
else:
    # print dates which are not in the sequence and add them while simultaneously adding NaNs to the data columns
    print(f'These dates are not in the sequence:{idx.difference(df["utc_time"])}')
    df = df.reindex(idx).rename_axis('time')
    
except pytz.exceptions.AmbiguousTimeError: # here python does not know how to handle the non-reapeated time

# create the localized datetime column with a list comprehension
df['time'] = [pytz.timezone('Europe/Berlin').localize(t, is_dst=True) for t in df.iloc[:, 0]]

# make an UTC index:
df.set_index(df['time'].dt.tz_convert('UTC'), inplace=True)

# create a new index of desired frequency from that and re-index:
idx = pd.date_range(df.index[0], df.index[-1], freq=freq, tz='UTC')


# Verify that time vector is complete
if len(np.unique(np.diff(df.index))) == 1:
    print('Time vector is complete!')
else:
    # print dates which are not in the sequence and add them while simultaneously adding NaNs to the data columns
    print(f'These were the dates which were not in the sequence:{pd.Series(idx.difference(df["time"]))}')
    df = df.reindex(idx).rename_axis('time')