Is there a way to find hourly averages in pandas timeframes that do not start from even hours?-CodePudding

I have a pandas dataframe (python) indexed with timestamps roughly every 10 seconds. I want to find hourly averages, but all functions I find start their averaging at even hours (e.g. hour 9 includes data from 08.00:00 to 08:59:50). Let's say I have the dataframe below.

Timestamp              value    data   
2022-01-01 00:00:00    0.0      5.31
2022-01-01 00:00:10    0.0      0.52
2022-01-01 00:00:20    1.0      9.03
2022-01-01 00:00:30    1.0      4.37
2022-01-01 00:00:40    1.0      8.03
                      ... 
2022-01-01 13:52:30    1.0      9.75
2022-01-01 13:52:40    1.0      0.62
2022-01-01 13:52:50    1.0      3.58
2022-01-01 13:53:00    1.0      8.23
2022-01-01 13:53:10    1.0      3.07
Freq: 10S, Length: 5000, dtype: float64

So what I want to do:

Only look at data where we have data that consistently through 1 hour has a value of 1
Find an hourly average of these hours (could e.g. be between 01:30:00-02:29:50 and 11:16:30 - 12:16:20)..

I hope I made my problem clear enough. How do I do this?

EDIT:

Maybe the question was a bit unclear phrased. I added a third column data, which is what I want to find the mean of. I am only interested in time intervals where, value = 1 consistently through one hour, the rest of the data can be excluded.

CodePudding user response：

If I understand you correctly you want a conditional mean - calculate the mean per hour of the data column conditional on the value column being all 1 for every 10s row in that hour.

Assuming your dataframe is called df, the steps to do this are:

Create a grouping column

This is your 'hour' column that can be created by

df['hour'] = df.Timestamp.hour

Create condition

Now we've got a column to identify groups we can check which groups are eligible - only those with value consistently equal to 1. If we have 10s intervals and it's per hour then if we group by hour and sum this column then we should get 360 as there are 360 10s intervals per hour.

Group and compute

We can now group and use the aggregate function to:

sum the value column to evaluate against our condition
compute the mean of the data column to return for the valid hours

# group and aggregate
df_mean = df[['hour', 'value', 'data']].groupby('hour').aggregate({'value': 'sum', 'data': 'mean'})
# apply condition
df_mean = df_mean[df_mean['value'] == 360]

That's it - you are left with a dataframe that contains the mean value of data for only the hours where you have a complete hour of value=1.

If you want to augment this so you don't have to start with the grouping as per hour starting as 08:00:00-09:00:00 and maybe you want to start as 08:00:10-09:00:10 then the solution is simple - augment the grouping column but don't change anything else in the process.

To do this you can use datetime.timedelta to shift things forward or back so that df.Timestamp.hour can still be leveraged to keep things simple.

Infer grouping from data

One final idea - if you want to infer which hours on a rolling basis you have complete data for then you can do this with a rolling sum - this is even easier. You:

compute the rolling sum of value and mean of data
only select where value is equal to 360

df_roll = df.rolling(360).aggregate({'value': 'sum', 'data': 'mean'})
df_roll = df_roll[df_roll['value'] == 360]

CodePudding user response：

Yes, there is. You need resample with an offset.

Make some test data

Please make sure to provide meaningful test data next time.

import pandas as pd
import numpy as np

# One day in 10 second intervals
index = pd.date_range(start='1/1/2018', end='1/2/2018', freq='10S')
df = pd.DataFrame({"data": np.random.random(len(index))}, index=index)

# This will set the first part of the data to 1, the rest to 0
df["value"] = (df.index < "2018-01-01 10:00:10").astype(int)

This is what we got:

>>> df 
                         data  value
2018-01-01 00:00:00  0.377082      1
2018-01-01 00:00:10  0.574471      1
2018-01-01 00:00:20  0.284629      1
2018-01-01 00:00:30  0.678923      1
2018-01-01 00:00:40  0.094724      1
...                       ...    ...
2018-01-01 23:59:20  0.839973      0
2018-01-01 23:59:30  0.890321      0
2018-01-01 23:59:40  0.426595      0
2018-01-01 23:59:50  0.089174      0
2018-01-02 00:00:00  0.351624      0

Get the mean per hour with an offset

Here is a small function that checks if all value rows in the slice are equal to 1 and returns the mean if so, otherwise it (implicitly) returns None.

def get_conditioned_average(frame):
    if frame.value.eq(1).all():
        return frame.data.mean()

Now just apply this to hourly slices, starting, e.g., at 10 seconds after the full hour.

df2 = df.resample('H', offset='10S').apply(get_conditioned_average)

This is the final result:

>>> df2
2017-12-31 23:00:10    0.377082
2018-01-01 00:00:10    0.522144
2018-01-01 01:00:10    0.506536
2018-01-01 02:00:10    0.505334
2018-01-01 03:00:10    0.504431
...             ...         ...
2018-01-01 19:00:10         NaN
2018-01-01 20:00:10         NaN
2018-01-01 21:00:10         NaN
2018-01-01 22:00:10         NaN
2018-01-01 23:00:10         NaN
Freq: H, dtype: float64