I have a pandas dataframe (python) indexed with timestamps roughly every 10 seconds. I want to find hourly averages, but all functions I find start their averaging at even hours (e.g. hour 9 includes data from 08.00:00 to 08:59:50). Let's say I have the dataframe below.
Timestamp value data
2022-01-01 00:00:00 0.0 5.31
2022-01-01 00:00:10 0.0 0.52
2022-01-01 00:00:20 1.0 9.03
2022-01-01 00:00:30 1.0 4.37
2022-01-01 00:00:40 1.0 8.03
...
2022-01-01 13:52:30 1.0 9.75
2022-01-01 13:52:40 1.0 0.62
2022-01-01 13:52:50 1.0 3.58
2022-01-01 13:53:00 1.0 8.23
2022-01-01 13:53:10 1.0 3.07
Freq: 10S, Length: 5000, dtype: float64
So what I want to do:
- Only look at data where we have data that consistently through 1 hour has a value of 1
- Find an hourly average of these hours (could e.g. be between 01:30:00-02:29:50 and 11:16:30 - 12:16:20)..
I hope I made my problem clear enough. How do I do this?
EDIT:
Maybe the question was a bit unclear phrased.
I added a third column data
, which is what I want to find the mean of. I am only interested in time intervals where, value = 1
consistently through one hour, the rest of the data can be excluded.
CodePudding user response:
If I understand you correctly you want a conditional mean - calculate the mean per hour of the data
column conditional on the value
column being all 1
for every 10s row in that hour.
Assuming your dataframe is called df
, the steps to do this are:
Create a grouping column
This is your 'hour' column that can be created by
df['hour'] = df.Timestamp.hour
Create condition
Now we've got a column to identify groups we can check which groups are eligible - only those with value
consistently equal to 1
. If we have 10s intervals and it's per hour then if we group by hour and sum this column then we should get 360 as there are 360 10s intervals per hour.
Group and compute
We can now group and use the aggregate function to:
- sum the
value
column to evaluate against our condition - compute the mean of the
data
column to return for the valid hours
# group and aggregate
df_mean = df[['hour', 'value', 'data']].groupby('hour').aggregate({'value': 'sum', 'data': 'mean'})
# apply condition
df_mean = df_mean[df_mean['value'] == 360]
That's it - you are left with a dataframe that contains the mean value of data
for only the hours where you have a complete hour of value=1
.
If you want to augment this so you don't have to start with the grouping as per hour starting as 08:00:00-09:00:00 and maybe you want to start as 08:00:10-09:00:10 then the solution is simple - augment the grouping column but don't change anything else in the process.
To do this you can use datetime.timedelta
to shift things forward or back so that df.Timestamp.hour
can still be leveraged to keep things simple.
Infer grouping from data
One final idea - if you want to infer which hours on a rolling basis you have complete data for then you can do this with a rolling sum - this is even easier. You:
- compute the rolling sum of
value
and mean ofdata
- only select where
value
is equal to 360
df_roll = df.rolling(360).aggregate({'value': 'sum', 'data': 'mean'})
df_roll = df_roll[df_roll['value'] == 360]
CodePudding user response:
Yes, there is. You need resample
with an offset
.
Make some test data
Please make sure to provide meaningful test data next time.
import pandas as pd
import numpy as np
# One day in 10 second intervals
index = pd.date_range(start='1/1/2018', end='1/2/2018', freq='10S')
df = pd.DataFrame({"data": np.random.random(len(index))}, index=index)
# This will set the first part of the data to 1, the rest to 0
df["value"] = (df.index < "2018-01-01 10:00:10").astype(int)
This is what we got:
>>> df
data value
2018-01-01 00:00:00 0.377082 1
2018-01-01 00:00:10 0.574471 1
2018-01-01 00:00:20 0.284629 1
2018-01-01 00:00:30 0.678923 1
2018-01-01 00:00:40 0.094724 1
... ... ...
2018-01-01 23:59:20 0.839973 0
2018-01-01 23:59:30 0.890321 0
2018-01-01 23:59:40 0.426595 0
2018-01-01 23:59:50 0.089174 0
2018-01-02 00:00:00 0.351624 0
Get the mean per hour with an offset
Here is a small function that checks if all value
rows in the slice are equal to 1 and returns the mean
if so, otherwise it (implicitly) returns None
.
def get_conditioned_average(frame):
if frame.value.eq(1).all():
return frame.data.mean()
Now just apply this to hourly slices, starting, e.g., at 10 seconds after the full hour.
df2 = df.resample('H', offset='10S').apply(get_conditioned_average)
This is the final result:
>>> df2
2017-12-31 23:00:10 0.377082
2018-01-01 00:00:10 0.522144
2018-01-01 01:00:10 0.506536
2018-01-01 02:00:10 0.505334
2018-01-01 03:00:10 0.504431
... ... ...
2018-01-01 19:00:10 NaN
2018-01-01 20:00:10 NaN
2018-01-01 21:00:10 NaN
2018-01-01 22:00:10 NaN
2018-01-01 23:00:10 NaN
Freq: H, dtype: float64