Pandas apply function to multiple columns with sliding window-CodePudding

I need to calculate some metric using sliding window over dataframe. If metric needed just 1 column, I'd use rolling. But some how it does not work with 2 columns. Below is how I calculate the metric using regular cycle.

def mean_squared_error(aa, bb):
    return np.sum((aa - bb) ** 2) / len(aa)

def rolling_metric(df_, col_a, col_b, window, metric_fn):
    result = []
    for i, id_ in enumerate(df_.index):
        if i < (df_.shape[0] - window   1):
            slice_idx = df_.index[i: i window-1]
            slice_a, slice_b = df_.loc[slice_idx, col_a], df_.loc[slice_idx, col_b]
            result.append(metric_fn(slice_a, slice_b))
        else:
            result.append(None)
    return pd.Series(data = result, index = df_.index)

df = pd.DataFrame(data=(np.random.rand(1000, 2)*10).round(2), columns = ['y_true', 'y_pred'] )

%time df2 = rolling_metric(df, 'y_true', 'y_pred', window=7, metric_fn=mean_squared_error)

This takes close to a second for just 1000 rows.

Please suggest faster vectorized way to calculate such metric over sliding window.

CodePudding user response：

In this specific case:

You can calculate the squared error beforehand and then use .Rolling.mean():

df['sq_error'] = (df['y_true'] - df['y_pred'])**2

%time df['sq_error'].rolling(6).mean().dropna()

Please note that in your example the actual window size is 6 (print the slice length), that's why I set it to 6 in my snippet.

You can even write it like this:

%time df['y_true'].subtract(df['y_pred']).pow(2).rolling(6).mean().dropna()

In general:

In case you cannot reduce it to a single column, as of pandas 1.3.0 you can use the method='table parameter to apply the function to the entire DataFrame. This, however, has the following requirements:

This is only implemented when using the numba engine. So, you need to set engine='numba' in apply and have it installed.
You need to set raw=True in apply: this means in your function you will operate on numpy arrays instead of the DataFrame. This is a consequence of the previous point.

Therefore, your computation could be something like this:

WIN_LEN = 6

def mean_sq_err_table(arr, min_window=WIN_LEN):
    if len(arr) < min_window:
        return np.nan
    else:
        return np.mean((arr[:, 0] - arr[:, 1])**2)
    
df.rolling(WIN_LEN, method='table').apply(mean_sq_err_table, engine='numba', raw=True).dropna()

Because it uses numba, this is also relatively fast.