Home > OS >  Pandas numpy how to convert a complicated .apply fucntion to a vectorized function
Pandas numpy how to convert a complicated .apply fucntion to a vectorized function

Time:02-11

I have a dataframe and you can have it by run this code:

import numpy as np
import pandas as pd
from io import StringIO
# np.prod(PofMinTab1[LowerIntegralAge1[0]-1:(LowerIntegralAge1[0])])
dfs = """
    M0  M1  M2  M3 M4  M5 M6 M7 M8 M9 M10 M11 M12  age0  age4
1   1   2   3    4  5   6  1  2 3    4  5  6   1     3    5      
2   7   5   4    5  8   3  1  2 3    4  5  6   1     4     8
3   4   8   9    3  5   2  1  2 3    4  5  6   1     6      9
"""
df = pd.read_csv(StringIO(dfs.strip()), sep='\s ', )

And I have a .apply function:

def func(row):
    age0=row['age0']
    age4=row['age4']
    mt =[row['M0'],row['M1'],row['M2'],row['M3'], row['M4'],row['M5'],row['M6'],
         row['M7'],row['M8'],row['M9'],row['M10'],row['M11'],row['M12']]
                 
    return np.prod(mt[age0:age4])
    
df['newcol']=df.apply(lambda row: func(row), axis=1)

the output is:

M0  M1  M2  M3  M4  M5  M6  M7  M8  M9  M10 M11 M12   age0  age4    newcol
1   1   2   3   4   5   6   1   2   3   4   5   6   1   3   5       20
2   7   5   4   5   8   3   1   2   3   4   5   6   1   4   8       48
3   4   8   9   3   5   2   1   2   3   4   5   6   1   6   9       6

Since in my real business,I have 100000 rows data,each time I use .apply function it is very slow, so I've converted most of my functions to vectorized function.

So my question,is there any way I can convert this one to numpy vectorized way,or any other way that can make it runs very fast?

Any friend can help?

CodePudding user response:

One idea I had was to mask the original data using a condition on column indices.

# Create indices for the columns you want to compute the product over
idx = np.arange(len(df.columns) - 2)
# Create a mask of bools which correspond to the values the product 
# should be computed for
m = ((idx[None, :] >= df['age0'].to_numpy()[:,None]) 
      & (idx < df['age4'].to_numpy()[:,None]))
# Use `np.where` to apply the mask and `np.prod` to compute the row-wise product
df['result'] = np.prod(np.where(m, df.iloc[:, :-2], 1), axis=1)
df

   M0  M1  M2  M3  M4  M5  M6  M7  M8  M9  M10  M11  M12  age0  age4  result
1   1   2   3   4   5   6   1   2   3   4    5    6    1     3     5      20
2   7   5   4   5   8   3   1   2   3   4    5    6    1     4     8      48
3   4   8   9   3   5   2   1   2   3   4    5    6    1     6     9       6

CodePudding user response:

You can test for loop

[df.loc[z][x:y].prod() for x , y, z in zip(df['age0'],df['age4'],df.index)]
Out[43]: [20, 48, 6]
  • Related