So I have a data frame like this--
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,22], [1,23], [1,44], [2, 33], [2, 55]], columns=['id', 'delay'])
id delay
0 1 22
1 1 23
2 1 44
3 2 33
4 2 55
What I am doing is grouping by id
and doing rolling operation on the delay
column like below--
k = [0.1, 0.5, 1]
def f(d):
d['new_delay'] = pd.Series([0,0]).append(d['delay']).rolling(window=3).apply(lambda x: np.sum(x*k)).iloc[2:]
return d
df.groupby(['id']).apply(f)
id delay new_delay
0 1 22 22.0
1 1 23 34.0
2 1 44 57.7
3 2 33 33.0
4 2 55 71.5
It is working just fine but I am curious whether .apply
on grouped data frame is vectorized or not. Since my dataset is huge, is there a better-vectorized way to do this kind of operation? Also I am curious if Python is single-threaded and I am running on CPU how pandas, numpy achieve vectorized calculation.
CodePudding user response:
You can use strides for vectorized rolling
with GroupBy.transform
:
k = [0.1, 0.5, 1]
def rolling_window(a, window):
shape = a.shape[:-1] (a.shape[-1] - window 1, window)
strides = a.strides (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def f(d):
return np.sum(rolling_window(np.append([0,0],d.to_numpy()), 3) * k, axis=1)
df['new_delay'] = df.groupby('id')['delay'].transform(f)
print (df)
id delay new_delay
0 1 22 22.0
1 1 23 34.0
2 1 44 57.7
3 2 33 33.0
4 2 55 71.5
CodePudding user response:
Another option will be to use np.convolve()
instead:
# Our function
f = lambda x: np.convolve(np.array([1,0.5,0.1]),x)[:len(x)]
# Groupby Transform
df['new_delay'] = df.groupby('id')['delay'].transform(f)
Don't know if it's faster or not.
CodePudding user response:
Here is one approach with groupby
rolling
and apply
a custom function compiled using numba
def func(v):
k = np.array([0.1, 0.5, 1])
return np.sum(v * k[len(k) - len(v):])
(
df.groupby('id')['delay']
.rolling(3, min_periods=1)
.apply(func, raw=True, engine='numba')
.droplevel(0)
)
0 22.0
1 34.0
2 57.7
3 33.0
4 71.5
Name: delay, dtype: float64