problem with np.random.multinomial and size option in python-CodePudding

My problem is the following: I have N people choosing between three objects [1,2,3] with probabilities [p_1,p_2,p_3] such that p_1 p_2 p_3=1. Let's call X_1,X_2,X_3 the counts of the chosen objects in one sample among the N people (then, for example, X_1 is the number of people choosing object 1 ).

The vector X_1,X_2,X_3 follows a multinomial distribution as in Wikipedia. It is well known that cov(X_1,X_2) (covariance between X_1,X_2)=-N*p_1*p_2.

I want to verify this covariance formula. I did two experiments and I got different results. I cannot understand why.

Attempt A I coded (with p_1=0.4,p_2=0.2,p_3=0.4 and N=50):

q=np.random.multinomial(50, [0.4,0.2,0.4],size=1000)
df=pd.DataFrame(q,columns=["X_1","X_2","X_3"])
cov_matrix=np.cov([df["X_1"],df["X_2"],df["X_3"]])

In my specific case, I got cov(X_1,X_2)=-4.44586486 : it is very similar to what I was expecting as -N*p_1*p_2=-50*0.4*0.2=-4

Attempt B (where I sequentially create samples of multinomial draws) I coded:

s=[1]*1000 # 1000 as the size
df["constant"]=s
df["X_1"]= df.apply(lambda x: np.random.multinomial(50, [0.4,0.2,0.4])[0],axis=1)
df["X_2"]= df.apply(lambda x: np.random.multinomial(50, [0.4,0.2,0.4])[1],axis=1)
df["X_3"]= df.apply(lambda x: np.random.multinomial(50, [0.4,0.2,0.4])[2],axis=1)
cov_matrix=np.cov([df["X_1"],df["X_2"],df["X_3"]])

In my specific case, I got cov(X_1,X_2)=-0.087452 : it is very different than what I was expecting (that is 4).

It seems to me the only difference between A and B is that in A size=1000, whereas in B I am creating a draw for each row of my dataframe.

Why do I get different results? Which mistakes I am making? There was a similar question here, but answers are not very helpful.

CodePudding user response：

In the second example, each column is generated independent of the other columns, so each row is no longer a sample from a single multinomial distribution. (Note, for example, that the sums of rows are not 50.) In effect, each column is a binomial distribution that is independent of the other columns.

The second example is equivalent to this (note that I am using the newer random API):

In [142]: rng = np.random.default_rng()

In [143]: p = rng.binomial(50, [0.4, 0.2, 0.4], size=(1000, 3))

In [144]: p
Out[144]: 
array([[15, 12, 20],
       [22,  6, 22],
       [17, 12, 17],
       ...,
       [20, 13, 17],
       [27,  9, 24],
       [25,  9, 22]])

In [145]: np.cov(p.T)
Out[145]: 
array([[13.09065465,  0.15038238, -0.22775375],
       [ 0.15038238,  7.85507107, -0.0142022 ],
       [-0.22775375, -0.0142022 , 11.87303203]])