Integration of pandas timeframe-CodePudding

I want to integrate the following dataframe, such that I have the integrated value for every hour. I have roughly a 10s sampling rate, but if it is necissary to have an even timeinterval, I guess I can just use df.resample().

Timestamp                    Power [W]
2022-05-05 06:00:05 02:00    2.0
2022-05-05 06:00:15 02:00    1.2
2022-05-05 06:00:25 02:00    0.3
2022-05-05 06:00:35 02:00    4.3
2022-05-05 06:00:45 02:00    1.1
                            ... 
2022-05-06 20:59:19 02:00    1.4
2022-05-06 20:59:29 02:00    2.0
2022-05-06 20:59:39 02:00    4.1
2022-05-06 20:59:49 02:00    1.3
2022-05-06 20:59:59 02:00    0.8

So I want to be able to integrate over both hours and days, so my output could look like:

Timestamp                    Energy [Wh]
2022-05-05 07:00:00 02:00    some values
2022-05-05 08:00:00 02:00    .
2022-05-05 09:00:00 02:00    .
2022-05-05 10:00:00 02:00    .
2022-05-05 11:00:00 02:00    
                            ... 
2022-05-06 20:00:00 02:00    
2022-05-06 21:00:00 02:00

(hour 07:00 is to include values between 06:00-07:00, and so on...)

and

Timestamp      Energy [Wh]
2022-05-05     .
2022-05-06     .

So how do I achieve this? I was thinking I could use scipy.integrate, but my outputs look a bit weird.

Thank you.

CodePudding user response：

You could create a new column representing your Timestamp truncated to hours:

df['Timestamp_hour'] = df['Timestamp'].dt.hour

Please note that in that case, the rows between hour 6.00 to hour 6.59 will be included into the 6 hour and not the 7 one.

Then you can group your rows by your new column before applying your integration computation:

df_integrated_hour = (
    df
    .groupby('Timestamp_hour')
    .agg({
        'Power': YOUR_INTEGRATION_FUNCTION
    })
    .rename(columns={'Power': 'Energy'})
    .reset_index()
)

Hope this will help you

CodePudding user response：

Here's a very simple solution using rectangle integration with rectangles spaced in 10 second intervals starting at zero and therefore NOT centered exactly on the data points (assuming that the data is delivered in regular intervals and no data is missing), a.k.a. a simple average.

from numpy import random
import pandas as pd

times = pd.date_range('2022-05-05 06:00:04 02:00', '2022-05-06 21:00:00 02:00', freq='10S')
watts = random.rand(len(times)) * 5
df = pd.DataFrame(index=times, data=watts, columns=["Power [W]"])

hourly = df.groupby([df.index.date, df.index.hour]).mean()
hourly.columns = ["Energy [Wh]"]
print(hourly)

hours_in_a_day = 24  # add special casing for leap days here, if required
daily = df.groupby(df.index.date).mean()
daily.columns = ["Energy [Wh]"]
print(daily)

Output:

               Energy [Wh]
2022-05-05 6      2.625499
           7      2.365678
           8      2.579349
           9      2.569170
           10     2.543611
           11     2.742332
           12     2.478145
           13     2.444210
           14     2.507821
           15     2.485770
           16     2.414057
           17     2.567755
           18     2.393725
           19     2.609375
           20     2.525746
           21     2.421578
           22     2.520466
           23     2.653466
2022-05-06 0      2.559110
           1      2.519032
           2      2.472282
           3      2.436023
           4      2.378289
           5      2.549572
           6      2.558478
           7      2.470721
           8      2.429454
           9      2.390543
           10     2.538194
           11     2.537564
           12     2.492308
           13     2.387632
           14     2.435582
           15     2.581616
           16     2.389549
           17     2.461523
           18     2.576084
           19     2.523577
           20     2.572270
            Energy [Wh]
2022-05-05    60.597007
2022-05-06    59.725029

Trapezoidal integration should give a slightly better approximation but it's harder to implement right. You'd have to deal carefully with the hour boundaries. That's basically just a matter of inserting interpolated values twice at the full hour (at 09:59:59.999 and 10:00:00). But then you'd also have to figure out a way to extrapolate to the start and end of the range, i.e. in your example go from 06:00:05 to 06:00:00. But careful, what to do if your measurements only start somewhere in the middle like 06:17:23?