I have a dataset that marks the occurrences of an event every minute for four years. Here's a sample:
In [547]: result
Out[547]:
uuid timestamp col1 col2 col3
0 100 2016-03-30 00:00:00 02:00 NaN NaN NaN
1 100 2016-03-30 00:01:00 02:00 NaN NaN NaN
2 100 2016-03-30 00:02:00 02:00 NaN NaN NaN
3 100 2016-03-30 00:03:00 02:00 1.49 1.79 0.979
4 100 2016-03-30 00:04:00 02:00 NaN NaN NaN
... ... ... .. ...
1435 100 2016-03-30 23:55:00 02:00 NaN NaN NaN
1436 100 2016-03-30 23:56:00 02:00 1.39 2.19 1.09
1437 100 2016-03-30 23:57:00 02:00 NaN NaN NaN
1438 100 2016-03-30 23:58:00 02:00 NaN NaN NaN
1439 100 2016-03-30 23:59:00 02:00 NaN NaN NaN
[1440 rows x 5 columns]
I am trying to get summary statistics every time there is a non-blank row and get these statistics for every six hours. To do this, the resample()
function works great. Here's a sample:
In [548]: result = result.set_index('timestamp').tz_convert('Europe/Berlin').resample('6h', label='right', closed='right', origin='start_day').agg(['mean', 'last', 'count']).iloc[:,-9:]
Out[548]:
col1_mean col1_last ... col3_last times_changed
timestamp ...
2016-03-30 00:00:00 02:00 NaN NaN ... NaN 0
2016-03-30 07:00:00 02:00 1.0690 1.069 ... 1.279 1
2016-03-30 13:00:00 02:00 1.0365 1.009 ... 1.239 4
2016-03-30 19:00:00 02:00 1.0150 0.989 ... 1.209 5
2016-03-30 01:00:00 02:00 1.1290 1.129 ... 1.329 1
[5 rows x 7 columns]
This looks great and is the format I'd like to work with. However, when I run my code on all data (spanning many years), here's an excerpt of what the output looks like:
In [549]: result
Out[549]:
col1_mean col1_last ... col3_last times_changed
timestamp ...
2016-03-27 00:00:00 01:00 NaN NaN ... NaN 0
2016-03-27 07:00:00 02:00 1.0690 1.069 ... 1.279 1
2016-03-27 13:00:00 02:00 1.0365 1.009 ... 1.239 4
2016-03-27 19:00:00 02:00 1.0150 0.989 ... 1.209 5
2016-03-28 01:00:00 02:00 1.1290 1.129 ... 1.329 1
[5 rows x 7 columns]
The new index takes DST into consideration and throws everything off by an hour. I would like the new times to still be between 0–6, 6–12 etc.
Is there a way to coerce my dataset to adhere to a 0–6, 6–12 format? If there's an extra hour, maybe the aggregations from that could still be tucked into the 0–6 range?
The timezone I'm working with is Europe/Berlin and I tried converting everything to UTC. However, values are not at their right date or time — for example, an occurrence at 00:15hrs would be 23:15hrs the previous day, which throws off those summary statistics.
Are there any creative solutions to fix this?
CodePudding user response:
Have you tried this? I think it should work (First converts to local timezone, and then truncates the timezone info by .tz_localize(None))
result = result.set_index('timestamp').tz_convert('Europe/Berlin').tz_localize(None).resample('6h', label='right', closed='right', origin='start_day').agg(['mean', 'last', 'count']).iloc[:,-9:]