I have a df
and need to count how many adjacent columns have the same sign as other columns based on the sign of the first column, and multiply by the sign of the first column.
What I need to speed up is the calc_df
function, which runs like this on my computer:
%timeit calc_df(df)
6.38 s ± 170 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The output of my code is:
a_0 a_1 a_2 a_3 a_4 a_5 a_6 a_7 a_8 a_9
0 0.097627 0.430379 0.205527 0.089766 -0.152690 0.291788 -0.124826 0.783546 0.927326 -0.233117
1 0.583450 0.057790 0.136089 0.851193 -0.857928 -0.825741 -0.959563 0.665240 0.556314 0.740024
2 0.957237 0.598317 -0.077041 0.561058 -0.763451 0.279842 -0.713293 0.889338 0.043697 -0.170676
3 -0.470889 0.548467 -0.087699 0.136868 -0.962420 0.235271 0.224191 0.233868 0.887496 0.363641
4 -0.280984 -0.125936 0.395262 -0.879549 0.333533 0.341276 -0.579235 -0.742147 -0.369143 -0.272578
0 4.0
1 4.0
2 2.0
3 -1.0
4 -2.0
My code is as follows, where the generate_data
function generates demo data, which is consistent with my actual data volume.
import numpy as np
import pandas as pd
from numba import njit
np.random.seed(0)
pd.set_option('display.max_columns', None)
pd.set_option('expand_frame_repr', False)
# This function generates demo data.
def generate_data():
col = [f'a_{x}' for x in range(10)]
df = pd.DataFrame(data=np.random.uniform(-1, 1, [280000, 10]), columns=col)
return df
@njit
def calc_numba(s):
a = s[0]
b = 1
for sign in s[1:]:
if sign == a:
b = 1
else:
break
b *= a
return b
def calc_series(s):
return calc_numba(s.to_numpy())
def calc_df(df):
df1 = np.sign(df)
df['count'] = df1.apply(calc_series, axis=1)
return df
def main():
df = generate_data()
print(df.head(5))
df = calc_df(df)
print(df['count'].head(5))
return
if __name__ == '__main__':
main()
CodePudding user response:
You can use vectorial code here.
For example with a mask:
df1 = np.sign(df)
m = df1.eq(df1.iloc[:,0], axis=0).cummin(1)
out = df1.where(m).sum(1)
Output (5 first rows):
0 4.0
1 4.0
2 2.0
3 -1.0
4 -2.0
dtype: float64
Time to run on whole data:
269 ms ± 37.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Faster alternative:
df1 = np.sign(df)
m = df1.eq(df1.iloc[:,0], axis=0).cummin(1)
out = m.sum(1)*df1.iloc[:,0]
148 ms ± 27.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
And you can probably do even better with pure numpy (you have to write a cummin
equivalent).