Home > Software design >  Rolling average based on another column
Rolling average based on another column

Time:07-18

I have a dataframe df which looks like

time(float) value (float)
10.45 10
10.50 20
10.55 25
11.20 30
11.44 20
12.30 30

I need help to calculate a new column called rolling_average_value which is basically the average value of that row and all the values 1 hour before that row such that the new dataframe looks like.

time(float) value (float) rolling_average_value
10.45 10 10
10.50 20 15
10.55 25 18.33
11.20 30 21.25
11.44 20 21
12.30 30 25

Note: This time column is a float column

CodePudding user response:

You can temporarily set a datetime index and apply rolling.mean:

# extract hours/minuts from float
import numpy as np
minutes, hours = np.modf(df['time(float)'])
hours = hours.astype(int)
minutes = minutes.mul(100).astype(int)
dt = pd.to_datetime(hours.astype(str) minutes.astype(str), format='%H%M')

# perform rolling computation
df['rolling_mean'] = (df.set_axis(dt)
                        .rolling('1h')['value (float)']
                        .mean()
                        .set_axis(df.index)
                      )

output:

   time(float)  value (float)  rolling_mean
0        10.45             10     10.000000
1        10.50             20     15.000000
2        10.55             25     18.333333
3        11.20             30     21.250000
4        11.44             20     21.000000
5        12.30             30     25.000000

Alternative to compute dt:

dt = pd.to_datetime(df['time(float)'].astype(str)
                      .str.replace('\d ', lambda x: x.group().zfill(2),
                                   regex=True),
                    format='%H.%M')

CodePudding user response:

Assuming your data frame is sorted by time, you can also use a simple list comprehension to solve your problem. Iterate over times and get all indices where the distance from the previous time values to the actual iteration value is less than one (meaning less than one hour) and slice the value column that was converted to an array by those indices. Then, you can just compute the mean of the sliced array:

import pandas as pd
import numpy as np


df = pd.DataFrame(
    {"time": [10.45, 10.5, 10.55, 11.2, 11.44, 12.3],
    "value": [10, 20, 25, 30, 20, 30]}     
)

times = df["time"].values
values = df["value"].values

df["rolling_mean"] = [round(np.mean(values[np.where(times[i] - times[:i 1] < 1)[0]]), 2) for i in range(len(times))]

If your data frame is large, you can compile this loop in C/C too make it significantly faster:

from numba import njit

@njit
def compute_rolling_mean(times, values):
    return [round(np.mean(values[np.where(times[i] - times[:i 1] < 1)[0]]), 2) for i in range(len(times))]
    
df["rolling_mean"] = compute_rolling_mean(df["time"].values, df["value"].values)

Output:

    time    value  rolling_mean
0   10.45   10     10.00
1   10.50   20     15.00
2   10.55   25     18.33
3   11.20   30     21.25
4   11.44   20     21.00
5   12.30   30     25.00
  • Related