Since for-loops have bad performance in python, I need to speed up the following code.
Things I tried:
1. apply. -- Haven't figured out how to apply on multilevel df.
2. Numba. -- Seems Numba or Bodo do not support pandas rolling.
code as below:
df = pd.DataFrame(np.random.randn(9,3),columns=['A','B','C'])
df_result = pd.DataFrame()
shape = np.full(df.shape[1],1)
def func_cov(df):
df_cov = df.rolling(3,min_periods=3).cov()
for i in df.index:
df_result.loc[i,'result'] = np.dot(shape.T,np.dot(df_cov.loc[i], shape))
return df_result
func_cov(df)
df:
A B C
0 0.191484 0.765756 -1.288696
1 -0.111369 1.276903 1.567775
2 -0.209460 2.920247 0.142898
3 0.169375 1.096265 -0.646460
4 3.847551 0.936200 -1.221572
5 -1.783127 0.426784 1.311940
6 -0.417902 0.253048 0.097059
7 -1.176098 -0.975650 1.481306
8 -1.429595 0.257955 -0.832083
desired df_result:
result
0 NaN
1 NaN
2 3.258732
3 1.579507
4 2.359369
5 3.684835
6 4.364114
7 0.125943
8 0.981440
CodePudding user response:
You can convert the dataframe to a Numpy array and then do all the job using Numba and basic loops:
import numba as nb
df = pd.DataFrame(np.random.randn(9,3),columns=['A','B','C'])
df_result = pd.DataFrame()
shape = np.full(df.shape[1],1)
@nb.njit('(float64[:,::1], float64[:])')
def fast_func_cov(values, shape):
result = np.empty(len(values))
result[0] = result[1] = np.nan
for i in range(2, len(values)):
cov_mat = np.cov(values[i-2:i 1,:].T)
result[i] = np.dot(shape.T,np.dot(cov_mat, shape))
return result
fast_func_cov(np.ascontiguousarray(df.values), shape.astype(np.float64))
values = np.ascontiguousarray(df.values)
df_result['result'] = fast_func_cov(values, shape.astype(np.float64))
On my machine, the computation takes 0.016 ms compared to to 7 ms for the initial computing function. This is about 440 times faster. That being said, the Pandas assignment 0.032 ms resulting in a 150 times faster code overall.