I need to apply a function on df, I used a pandarallel
to parallelize the process, however, I have an issue here, I need to give func_do
an N rows each call so that I can utilize a vectorization on that function.
The following will call func_do
on each row. Any idea how to make a single call for each batch and keep the parallelization process.
def fun_do(value_col):
return do(value_col)
df['processed_col'] = df.parallel_apply(lambda row: fun_do(row['col']), axis=1)
CodePudding user response:
A possible solution is to create virtual groups of N rows:
import pandas as pd
from pandarallel import pandarallel
# Setup MRE
pandarallel.initialize(progress_bar=False)
df = pd.DataFrame({'col1': np.linspace(0, 100, 11)})
def fun_do(sr):
return sr**2
N = 4 # size of chunk
df['col2'] = (df.groupby(pd.RangeIndex(len(df)) // N)
.parallel_apply(lambda x: fun_do(x['col1']))
.droplevel(0)) # <- remove virtual group index
Output:
>>> df
col1 col2
0 0.0 0.0
1 10.0 100.0
2 20.0 400.0
3 30.0 900.0
4 40.0 1600.0
5 50.0 2500.0
6 60.0 3600.0
7 70.0 4900.0
8 80.0 6400.0
9 90.0 8100.0
10 100.0 10000.0
Note: I don't know why groupby(...)['col'].parallel_apply(fun_do)
doesn't work. It seems parallel_apply
is not available with SeriesGroupBy
.
This is the first time I use pandarallel
, usually I used multiprocessing
module