How to multicore processing a for loop with iterrows in python-CodePudding

I have a massive dataset that could use multicore processing. I have a dataframe that has sequences and blocksize for each row.

I wrote a loop that extracts the sequence and block size for each row and calculates a score from a function from a package called localcider.

I can't figure out how to run it in parallel.

Can somebody help?

omega = []
AA=list('FYW')
for i, row in df.iterrows():
    seq = df['IDRseq'][i]
    b = df['bsize'][i]
    bsize = [b-1,b]
    SeqOb = SequenceParameters(seq,blobsize=bsize)
    omega.append(SeqOb.get_kappa_X(AA))
    
s1 = pd.Series(omega, name='omega')
df = df.assign(omega=s1.values)

CodePudding user response：

After a lot of googling, I came across pandarallel.

I think this is the most intuitive way of doing what I want.

I am posting the code for future reference.

from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True, nb_workers = n)
# nb_workers = n ; I set the nb_workers fo CPU core - 1 so the system is more stable

def something(x):
 #do stuff
    return result

df['result'] = df.parallel_apply(something, axis=1)