I am using Pool.map()
in the multiprocessing
package on an embarrassingly parallel project. I want to seed the numpy random number generator once for each worker (not once per function call). My understanding from some past answers is that one should use the initializer parameter to seed each worker. However, when I pass numpy.random.seed
, I get very poor seeding--all threads generate mostly the same random numbers, but not all.
There have been some changes to the way random numbers work in numpy so perhaps some of those answers are out of date. Take a look at this minimal example that illustrates the issue:
import multiprocessing
import numpy as np
def my_fun(_):
return rng.uniform()
if __name__ == "__main__":
rng = np.random.default_rng()
with multiprocessing.Pool(processes=4, initializer=np.random.seed) as pool:
my_list = pool.map(my_fun, range(40))
print(f"Number of unique values: {len(set(my_list))}")
I would expect my_list
to contain exactly 40 distinct values if seeding works or exactly 10 if it does not. But it tends to be more like 12-15. Is there a different best-practice for seeding these workers? Remember, I do not want to add any code to my_fun()
because it will be called a large number of times by each worker. I just want each worker to start from a different place, so workers will be independent.
I do not require reproducibility for this project, but it would be nice if the solution did provide it. Python 3.10.5 on linux.
CodePudding user response:
You were close. Try this instead:
import multiprocessing
import numpy as np
def init():
global rng
rng = np.random.default_rng()
def my_fun(_):
return rng.uniform()
if __name__ == "__main__":
with multiprocessing.Pool(processes=4, initializer=init) as pool:
my_list = pool.map(my_fun, range(40))
print(f"Number of unique values: {len(set(my_list))}")
The recommendation is that instead of using seeding, you should create a new instance of the generator instead. Here we're creating one new, freshly seed generator for each pool.
For reproducible results, add code to init()
to pickle each new generator or print its state:
print(rng.__getstate__())
The output is sufficient to reconstructor the generator state. It looks like this:
{'bit_generator': 'PCG64',
'state':
{'state': 319129345033546980483845008489532042435,
'inc': 198751095538548372906400916761105570237},
'has_uint32': 0,
'uinteger': 0}
CodePudding user response:
I'm not sure I have a better answer than Raymond, but I was evidently unfamiliar with the rather significant changes to numpy's random module so I did some investigation to understand it further.
From the documentation, Numpy now discourages using the module constant generator: mtrand.RandomState
in favor of explicitly creating your own Generator
instance via default_rng()
. This encourages programmers to be deliberate about randomness particularly for security (or repeatability). In your code you seem to mix attempting to use the old style and new style of the numpy.random
module. This is the actual error, for which Raymond posted the solution: np.random.seed
does not set the seed for rng
, it sets the seed for np.random.mtrand.RandomState
. You are then using the same rng
for each child process (copied via fork) which does not get updated by np.random.seed
. On a side note, I would also point out that relying on fork to copy rng
so it is available in my_fun
only works with fork, and won't work on Windows or MacOS where spawn is used. Raymond's solution also solves this problem, as rng
is explicitly created in the child after fork/spawn.
Finally, you say you have an "embarrassingly parallel" scenario, and according to the docs, there may be instances where you want to use a different BitGenerator
to the standard PCG-64 generator (the math is a bit over my head though):
rng = np.random.Generator(np.random.PCG64DXSM())
#or if you want to know for sure you're using the best available seed source:
rng = np.random.Generator(np.random.PCG64DXSM(secrets.randbits(128)))