I have a very large matrix (over 100k by 100K) with a calculation logic whereby each row can be calculated distinct from other rows
I want to use multiprocessing to optimize compute time (with the matrix split into 3 slices of 1/3 rows each). However it seems like multiprocessing takes longer than a single call to calculate all rows. I am changing different parts of the matrix in each process- is that the issue?
import multiprocessing, os
import time, pandas as pd, numpy as np
def mat_proc(df):
print("ID of process running worker1: {}".format(os.getpid()))
return(df 3) # simplified version of process
print('done processing')
count=5000
df = pd.DataFrame(np.random.randint(0,10,size=(3*count,3*count)),dtype='int8')
slice1=df.iloc[0:count,]
slice2=df.iloc[count:2*count,]
slice3=df.iloc[2*count:3*count,]
p1=multiprocessing.Process(target=mat_proc,args=(slice1,))
p2=multiprocessing.Process(target=mat_proc,args=(slice2,))
p3=multiprocessing.Process(target=mat_proc,args=(slice3,))
start=time.time()
print('started now')
# this is to compare the multiprocess with a single call to full matrix
#mat_proc(df)
if __name__ == '__main__':
p1.start()
p2.start()
p3.start()
p1.join()
p2.join()
p3.join()
finish=time.time()
print(f'total time taken {round(finish-start,2)}')
CodePudding user response:
When using multiprocessing move all script parts to if __name__ == '__main__'
part. Because when each process spawns it runs your main script. So each process had to recreate dataframe, slicing, etc.
import multiprocessing, os
import time, pandas as pd, numpy as np
def mat_proc(df):
print("ID of process running worker1: {}".format(os.getpid()))
return (df 3) # simplified version of process
print('done processing')
if __name__ == '__main__':
count = 5000
df = pd.DataFrame(np.random.randint(0, 10, size=(3 * count, 3 * count)), dtype='int8')
slice1 = df.iloc[0:count, ]
slice2 = df.iloc[count:2 * count, ]
slice3 = df.iloc[2 * count:3 * count, ]
p1 = multiprocessing.Process(target=mat_proc, args=(slice1,))
p2 = multiprocessing.Process(target=mat_proc, args=(slice2,))
p3 = multiprocessing.Process(target=mat_proc, args=(slice3,))
start = time.time()
print('started now')
# this is to compare the multiprocess with a single call to full matrix
# mat_proc(df)
p1.start()
p2.start()
p3.start()
p1.join()
p2.join()
p3.join()
finish = time.time()
print(f'total time taken {round(finish - start, 2)}')
And consider using multiprocessing.Pool
, it can be handy to be able to choose how many processes you want to spawn by changing single number.
Second thing, if computations are easy (as in the simplified version of process you provieded) spawning processes, sending data to it (pickling and unpickling dataframe) will take longer than those computations and multiprocessing will be slower.
CodePudding user response:
Spawning processes is a costly operation. If you're not performing tasks in the new processes which make the process spawn time seem neglible you would be better off sticking to one process.
Another option could be to use multithreading, which costs less than multiprocessing. You must decide which one to use based on the scale of your data & total processing time.
This article explains the differences and costs well. Check it out!
Also, using multiprocessing.pool.Pool & multiprocessing.pool.ThreadPool would be cleaner. Check the example below & the official doc to understand their usages.
from multithreading.pool import Pool, ThreadPool
def run_parallel(kls):
with kls() as pool:
return pool.map(mat_proc, [df.iloc[0:count,], df.iloc[count: 2 * count, ], df.iloc[2 * count: 3 * count, ]])
run_parallel(Pool) # Run with multiprocessing
run_parallel(ThreadPool) # Run with multithreading