python multiprocess very slow-CodePudding

I'm having trouble using python multiprocess.

im trying with a minimal version of code:

import os
os.environ["OMP_NUM_THREADS"] = "1" # just in case the system uses multithrad somehow
os.environ["OPENBLAS_NUM_THREADS"] = "1" # just in case the system uses multithrad somehow
os.environ["MKL_NUM_THREADS"] = "1" # just in case the system uses multithrad somehow
os.environ["VECLIB_MAXIMUM_THREADS"] = "1" # just in case the system uses multithrad somehow
os.environ["NUMEXPR_NUM_THREADS"] = "1" # just in case the system uses multithrad somehow
import numpy as np
from datetime import datetime as dt
from multiprocessing import Pool
from pandas import DataFrame as DF

def trytrytryshare(times):        
    i = 0
    for j in range(times):
        i =1
    return


def trymultishare(thread = 70 , times = 10):
    st = dt.now()
    args_l = [(times,) for i in range(thread)]               
    print(st)
    p = Pool(thread)
    for i in range(len(args_l)):
       p.apply_async(func = trytrytryshare, args = (args_l[i]))
    p.close()
    p.join()
    timecost = (dt.now()-st).total_seconds()
    print('%d threads finished in %f secs' %(thread,timecost))
    return timecost
   

if __name__ == '__main__':
    res = DF(columns = ['thread','timecost'])
    n = 0
    for j in range(5):
        for i in range(1,8,3):
            timecost = trymultishare(thread = i,times = int(1e8)) 
            res.loc[n] = [i,timecost]
            n =1
        timecost = trymultishare(thread = 70,times = int(1e8)) 
        res.loc[n] = [70,timecost]
        n =1
    res_sum = res.groupby('thread').mean()
    res_sum['decay'] =  res_sum.loc[1,'timecost'] / res_sum['timecost']

on my own computer (8cores): enter image description here

on my server (80 cores, im the only one using it) enter image description here

i tried again, make one thread job longer. enter image description here

the decay is really bad....

any idea how to "fix" this, or this is just what i can get when using multi-process?

thanks

CodePudding user response：

The way you're timing apply_async is flawed. You won't know when the subprocesses have completed unless you wait for their results.

It's a good idea to work out an optimum process pool size based on number of CPUs. The code that follows isn't necessarily the best for all cases but it's what I use.

You shouldn't set the pool size to the number of processes you intend to run. That's the whole point of using a pool.

So here's a simpler example of how you could test subprocess performance.

from multiprocessing import Pool
from time import perf_counter
from os import cpu_count

def process(n):
    r = 0
    for _ in range(n):
        r  = 1
    return r

POOL = max(cpu_count()-2, 1)
N = 1_000_000

def main(procs):
    # no need for pool size to be bigger than the numer of processes to be run
    poolsize = min(POOL, procs)
    with Pool(poolsize) as pool:
        _start = perf_counter()
        for result in [pool.apply_async(process, (N,)) for _ in range(procs)]:
            result.wait() # wait for async processes to terminate
        _end = perf_counter()
    print(f'Duration for {procs} processes with pool size of {poolsize} = {_end-_start:.2f}s')

if __name__ == '__main__':
    print(f'CPU count = {cpu_count()}')
    for procs in range(10, 101, 10):
        main(procs)

Output:

CPU count = 20
Duration for 10 processes with pool size of 10 = 0.12s
Duration for 20 processes with pool size of 18 = 0.19s
Duration for 30 processes with pool size of 18 = 0.18s
Duration for 40 processes with pool size of 18 = 0.28s
Duration for 50 processes with pool size of 18 = 0.30s
Duration for 60 processes with pool size of 18 = 0.39s
Duration for 70 processes with pool size of 18 = 0.42s
Duration for 80 processes with pool size of 18 = 0.45s
Duration for 90 processes with pool size of 18 = 0.54s
Duration for 100 processes with pool size of 18 = 0.59s

CodePudding user response：

My guess is that you're observing the cost of spawning new processes, since apply_async returns immediately. It's much cheaper to spawn one process in the case of thread==1 instead of spawning 70 processes (your last case with the worst decay).

The fact that the server with 80 cores performs better than you laptop with 8 cores could be due to the server containing better hardware in general (better heat removal, faster CPU, etc) or it might contain a different OS. Benchmarking across different machines is non-trivial.