I have two time series file which is meant to be in CET / CEST. The bad one of them, does not write the values in a proper way. For the good csv, see here:
#test_good.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 01:00,7224
2017-10-29 02:00,7225
2017-10-29 02:00,7226
2017-10-29 03:00,7227
...
...everything works fine by using:
df['utc_time'] = pd.to_datetime(df[local_time_column])
.dt.tz_localize('CET', ambiguous="infer")
.dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
When converting the test_bad.csv to UTC I get an AmbiguousTimeError as the 2 hours in October are missing.
# test_bad.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017 # everything is as it should be
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 01:00,7223
2017-10-29 02:00,7224 # the value of 2 am should actually be repeated PLUS 3 am is missing
2017-10-29 04:00,7226
2017-10-29 05:00,7227
...
Does anyone know an elegant way of how to still convert the time series file to UTC and add NaN columns for the missing dates in the new index? Thanks for your help.
CodePudding user response:
Elaborating a bit on Mark Ransom's comment;
2017-10-29 02:00,7224
is ambiguous; it could be 2017-10-29 00:00 UTC
or 2017-10-29 01:00 UTC
. That's why pd.to_datetime refuses to infer anything.
With the help of some native Python, you can work around. Assuming you just loaded the csv to a df without parsing anything to datetime, you can continue like
from datetime import datetime
import pytz
df['local_time'] = [pytz.timezone('Europe/Berlin').localize(datetime.fromisoformat(t)) for t in df['local_time']]
# so you can make a UTC index:
df.set_index(df['local_time'].dt.tz_convert('UTC'), inplace=True)
# Now you can create a new, hourly index from that and re-index:
dti = pd.date_range(df.index[0], df.index[-1], freq='H')
df2 = df.reindex(dti)
# for comparison, the "re-created" local_time column:
df2['local_time'] = df2.index.tz_convert('Europe/Berlin').strftime('%Y-%m-%d %H:%M:%S').values
that should give you something like
df2
value local_time
2017-03-25 23:00:00 00:00 2016.0 2017-03-26 00:00:00
2017-03-26 00:00:00 00:00 2017.0 2017-03-26 01:00:00
2017-03-26 01:00:00 00:00 2018.0 2017-03-26 03:00:00
2017-03-26 02:00:00 00:00 2019.0 2017-03-26 04:00:00
2017-03-26 03:00:00 00:00 NaN 2017-03-26 05:00:00
... ...
2017-10-29 00:00:00 00:00 NaN 2017-10-29 02:00:00
2017-10-29 01:00:00 00:00 7224.0 2017-10-29 02:00:00 # note: value randomly attributed to "second" 2 am
2017-10-29 02:00:00 00:00 NaN 2017-10-29 03:00:00
2017-10-29 03:00:00 00:00 7226.0 2017-10-29 04:00:00
2017-10-29 04:00:00 00:00 7227.0 2017-10-29 05:00:00
As commented above, the value 7224
is attributed to 2017-10-29 01:00:00 UTC
, but it could as well be attributed to 2017-10-29 00:00:00 UTC
If you don't care, you're fine. If that's a problem, best you can do in my opinion is discard the value. You can do so by using
df['local_time'] = pd.to_datetime(df['local_time']).dt.tz_localize('Europe/Berlin', ambiguous='NaT')
instead of the native Python part in the code above.
CodePudding user response:
Just to provide the solution which I am using for this workaround: It uses some try: except: functionality in case of the ambiguous time error. This should on the one hand side convert the time vector to UTC while it also fills missing values by reindexin. Feel free to suggest improvements.
try: # here everything is as expected and one hour is missing in Mar and one hour is repeated in Oct
# Localize tz-naive index of the DataFrame to target time zone.
df['time'] = df.iloc[:,0].dt.tz_localize('CET', ambiguous='infer').dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
df = df.set_index(pd.to_datetime(df['time'], utc=True))
# Create a complete time vector in UTC for latter reindexing
idx = pd.date_range(df.index.min(), df.index.max(), freq=freq, tz='UTC')
# Verify that time vector is complete
if len(np.unique(np.diff(df.index))) == 1:
print('Time vector is complete!')
else:
# print dates which are not in the sequence and add them while simultaneously adding NaNs to the data columns
print(f'These dates are not in the sequence:{idx.difference(df["utc_time"])}')
df = df.reindex(idx).rename_axis('time')
except pytz.exceptions.AmbiguousTimeError: # here python does not know how to handle the non-reapeated time
# create the localized datetime column with a list comprehension
df['time'] = [pytz.timezone('Europe/Berlin').localize(t, is_dst=True) for t in df.iloc[:, 0]]
# make an UTC index:
df.set_index(df['time'].dt.tz_convert('UTC'), inplace=True)
# create a new index of desired frequency from that and re-index:
idx = pd.date_range(df.index[0], df.index[-1], freq=freq, tz='UTC')
# Verify that time vector is complete
if len(np.unique(np.diff(df.index))) == 1:
print('Time vector is complete!')
else:
# print dates which are not in the sequence and add them while simultaneously adding NaNs to the data columns
print(f'These were the dates which were not in the sequence:{pd.Series(idx.difference(df["time"]))}')
df = df.reindex(idx).rename_axis('time')