Difference between Numpy Randn and RandomState-CodePudding

My impression about the np.random.randn(n) produces different samples if executed second time. np.random.RandomState(n_clusters).randn(n) produces same samples if executed second time. Is this correct? Also, what does np.random.seed() does?

My code:

np.random.RandomState(2).randn(2)
Out[6]: array([-0.41675785, -0.05626683])

np.random.RandomState(4).randn(2)
Out[7]: array([0.05056171, 0.49995133])

np.random.RandomState(42).randn(2)
Out[8]: array([ 0.49671415, -0.1382643 ])

np.random.RandomState(42).randn(2)
Out[9]: array([ 0.49671415, -0.1382643 ])

np.random.RandomState(4).randn(2)
Out[10]: array([0.05056171, 0.49995133])

np.random.RandomState(2).randn(2)
Out[11]: array([-0.41675785, -0.05626683])

np.random.randn(2)
Out[12]: array([ 0.47143516, -1.19097569])

np.random.randn(2)
Out[13]: array([ 1.43270697, -0.3126519 ])

CodePudding user response：

np.random uses a Pseudorandom number generator (also called PRNG) to generate a sequence of numbers which look random. Basically it has an internal "seed" number that it applies some function to which generates the next number in the sequence. This function then updates the internal seed so the next number in the sequence will likely be different.

np.random.RandomState(2) creates a new PRNG with its internal seed set to 2. This generator will produce numbers from a fixed sequence, which is why every time you call np.random.RandomState(2).randn(2), you get the same 2 numbers. If you instead saved the RandomState object and continually called randn(2) on it, you'd get the same sequence of numbers as another RandomState(2).

>>> rs1 = np.random.RandomState(2)
>>> rs2 = np.random.RandomState(2)
>>> rs1.randn(2), rs2.randn(2)
(array([-0.41675785, -0.05626683]), array([-0.41675785, -0.05626683]))
>>> rs1.randn(2), rs2.randn(2)
(array([-2.1361961 ,  1.64027081]), array([-2.1361961 ,  1.64027081]))

np.random.seed(2) will set the seed to a global instance of this PRNG to 2. Normally its seeded with something like the timestamp from when the process started so you get new random numbers every time you run a program. Setting this seed will make it so you get a deterministic sequence of random numbers when calling things like np.random.randn(2), which uses the global PRNG.

>>> np.random.seed(2)
>>> np.random.randn(2)
array([-0.41675785, -0.05626683])
>>> np.random.randn(2)
array([-2.1361961 ,  1.64027081])

CodePudding user response：

randn returns values from a standard normal distribution using a random number generator. It uses the mathematical algorithm specific to the object you called this function on, and the values depend on the last state of the engine.
RandomState returns a pseudo-random number generator engine using the Mersenne Twister algorithm, initializing its state based on the integer you passed.

You are correct, the np.random.RandomState(n_clusters) part in np.random.RandomState(n_clusters).randn(n) creates first a prng engine with the seed n_clusters. The initial state depends only on the seed, so if you create an engine with the same seed later, it will be in the same initial state, generating the same random number. Then the method randn(n) uses the underlying Mersenne Twister algorithm to generate 2 random numbers (internally they are 32-bit integers, and a total of 4 are generated, that are turned into values) from a normal distribution.

The initial state of the prng engine must be defined in the beginning. You can define the seed, or if you omit it, like in np.random.seed(), numpy defines it for you, based on the time or by reading some part of your disk to generate the initial source of randomness. The function np.random.seed defines the seed for the global prng engine, in contrast to np.random.RandomState, where you can save the return value and use that engine later again,

myprng = numpy.random.RandomState(2)
myprng.randn(10)