Home > Software engineering >  Resampling in Pandas is thrown off by daylight savings time
Resampling in Pandas is thrown off by daylight savings time

Time:03-07

I have a dataset that marks the occurrences of an event every minute for four years. Here's a sample:

In [547]: result
Out[547]: 
      uuid                 timestamp  col1  col2  col3
0      100 2016-03-30 00:00:00 02:00     NaN NaN  NaN
1      100 2016-03-30 00:01:00 02:00     NaN NaN  NaN
2      100 2016-03-30 00:02:00 02:00     NaN NaN  NaN
3      100 2016-03-30 00:03:00 02:00     1.49 1.79  0.979
4      100 2016-03-30 00:04:00 02:00     NaN NaN  NaN
   ...                       ...     ...  ..  ...
1435   100 2016-03-30 23:55:00 02:00     NaN NaN  NaN
1436   100 2016-03-30 23:56:00 02:00     1.39 2.19  1.09
1437   100 2016-03-30 23:57:00 02:00     NaN NaN  NaN
1438   100 2016-03-30 23:58:00 02:00     NaN NaN  NaN
1439   100 2016-03-30 23:59:00 02:00     NaN NaN  NaN

[1440 rows x 5 columns]

I am trying to get summary statistics every time there is a non-blank row and get these statistics for every six hours. To do this, the resample() function works great. Here's a sample:

In [548]: result = result.set_index('timestamp').tz_convert('Europe/Berlin').resample('6h', label='right', closed='right', origin='start_day').agg(['mean', 'last', 'count']).iloc[:,-9:]

Out[548]: 
                           col1_mean  col1_last  ...  col3_last  times_changed
timestamp                                            ...                         
2016-03-30 00:00:00 02:00          NaN          NaN  ...       NaN              0
2016-03-30 07:00:00 02:00       1.0690        1.069  ...     1.279              1
2016-03-30 13:00:00 02:00       1.0365        1.009  ...     1.239              4
2016-03-30 19:00:00 02:00       1.0150        0.989  ...     1.209              5
2016-03-30 01:00:00 02:00       1.1290        1.129  ...     1.329              1

[5 rows x 7 columns]

This looks great and is the format I'd like to work with. However, when I run my code on all data (spanning many years), here's an excerpt of what the output looks like:

In [549]: result

Out[549]: 
                           col1_mean  col1_last  ...  col3_last  times_changed
timestamp                                            ...                         
2016-03-27 00:00:00 01:00          NaN          NaN  ...       NaN              0
2016-03-27 07:00:00 02:00       1.0690        1.069  ...     1.279              1
2016-03-27 13:00:00 02:00       1.0365        1.009  ...     1.239              4
2016-03-27 19:00:00 02:00       1.0150        0.989  ...     1.209              5
2016-03-28 01:00:00 02:00       1.1290        1.129  ...     1.329              1

[5 rows x 7 columns]

The new index takes DST into consideration and throws everything off by an hour. I would like the new times to still be between 0–6, 6–12 etc.

Is there a way to coerce my dataset to adhere to a 0–6, 6–12 format? If there's an extra hour, maybe the aggregations from that could still be tucked into the 0–6 range?

The timezone I'm working with is Europe/Berlin and I tried converting everything to UTC. However, values are not at their right date or time — for example, an occurrence at 00:15hrs would be 23:15hrs the previous day, which throws off those summary statistics.

Are there any creative solutions to fix this?

CodePudding user response:

Have you tried this? I think it should work (First converts to local timezone, and then truncates the timezone info by .tz_localize(None))

result = result.set_index('timestamp').tz_convert('Europe/Berlin').tz_localize(None).resample('6h', label='right', closed='right', origin='start_day').agg(['mean', 'last', 'count']).iloc[:,-9:]
  • Related