I want to integrate the following dataframe, such that I have the integrated value for every hour. I have roughly a 10s sampling rate, but if it is necissary to have an even timeinterval, I guess I can just use df.resample()
.
Timestamp Power [W]
2022-05-05 06:00:05 02:00 2.0
2022-05-05 06:00:15 02:00 1.2
2022-05-05 06:00:25 02:00 0.3
2022-05-05 06:00:35 02:00 4.3
2022-05-05 06:00:45 02:00 1.1
...
2022-05-06 20:59:19 02:00 1.4
2022-05-06 20:59:29 02:00 2.0
2022-05-06 20:59:39 02:00 4.1
2022-05-06 20:59:49 02:00 1.3
2022-05-06 20:59:59 02:00 0.8
So I want to be able to integrate over both hours and days, so my output could look like:
Timestamp Energy [Wh]
2022-05-05 07:00:00 02:00 some values
2022-05-05 08:00:00 02:00 .
2022-05-05 09:00:00 02:00 .
2022-05-05 10:00:00 02:00 .
2022-05-05 11:00:00 02:00
...
2022-05-06 20:00:00 02:00
2022-05-06 21:00:00 02:00
(hour 07:00 is to include values between 06:00-07:00, and so on...)
and
Timestamp Energy [Wh]
2022-05-05 .
2022-05-06 .
So how do I achieve this? I was thinking I could use scipy.integrate
, but my outputs look a bit weird.
Thank you.
CodePudding user response:
You could create a new column representing your Timestamp truncated to hours:
df['Timestamp_hour'] = df['Timestamp'].dt.hour
Please note that in that case, the rows between hour 6.00 to hour 6.59 will be included into the 6 hour and not the 7 one.
Then you can group your rows by your new column before applying your integration computation:
df_integrated_hour = (
df
.groupby('Timestamp_hour')
.agg({
'Power': YOUR_INTEGRATION_FUNCTION
})
.rename(columns={'Power': 'Energy'})
.reset_index()
)
Hope this will help you
CodePudding user response:
Here's a very simple solution using rectangle integration with rectangles spaced in 10 second intervals starting at zero and therefore NOT centered exactly on the data points (assuming that the data is delivered in regular intervals and no data is missing), a.k.a. a simple average.
from numpy import random
import pandas as pd
times = pd.date_range('2022-05-05 06:00:04 02:00', '2022-05-06 21:00:00 02:00', freq='10S')
watts = random.rand(len(times)) * 5
df = pd.DataFrame(index=times, data=watts, columns=["Power [W]"])
hourly = df.groupby([df.index.date, df.index.hour]).mean()
hourly.columns = ["Energy [Wh]"]
print(hourly)
hours_in_a_day = 24 # add special casing for leap days here, if required
daily = df.groupby(df.index.date).mean()
daily.columns = ["Energy [Wh]"]
print(daily)
Output:
Energy [Wh]
2022-05-05 6 2.625499
7 2.365678
8 2.579349
9 2.569170
10 2.543611
11 2.742332
12 2.478145
13 2.444210
14 2.507821
15 2.485770
16 2.414057
17 2.567755
18 2.393725
19 2.609375
20 2.525746
21 2.421578
22 2.520466
23 2.653466
2022-05-06 0 2.559110
1 2.519032
2 2.472282
3 2.436023
4 2.378289
5 2.549572
6 2.558478
7 2.470721
8 2.429454
9 2.390543
10 2.538194
11 2.537564
12 2.492308
13 2.387632
14 2.435582
15 2.581616
16 2.389549
17 2.461523
18 2.576084
19 2.523577
20 2.572270
Energy [Wh]
2022-05-05 60.597007
2022-05-06 59.725029
Trapezoidal integration should give a slightly better approximation but it's harder to implement right. You'd have to deal carefully with the hour boundaries. That's basically just a matter of inserting interpolated values twice at the full hour (at 09:59:59.999 and 10:00:00). But then you'd also have to figure out a way to extrapolate to the start and end of the range, i.e. in your example go from 06:00:05 to 06:00:00. But careful, what to do if your measurements only start somewhere in the middle like 06:17:23?