Frequency of numpy repetition by position-CodePudding

I am using a numpy solution to perform a complete sampling without replacement, according to a list of weights, and doing this N times. So for this example below, I want to sample from the numbers 0-3 without replacement, sampling all numbers, and repeating that process 10 times. Here is what I've done so far:

np.random.seed(seed=123);

N = 10;
samples = []
P = [0.5,0.3,0.1,0.1]

for i in np.arange(N):
    picks = np.random.choice(4,size=4,replace=False, p=P)
    samples.append(picks)

samples

It produces:

[array([1, 0, 3, 2]),
 array([3, 1, 0, 2]),
 array([1, 0, 3, 2]),
 array([0, 1, 3, 2]),
 array([1, 0, 2, 3]),
 array([0, 1, 3, 2]),
 array([1, 0, 3, 2]),
 array([0, 3, 1, 2]),
 array([2, 1, 0, 3]),
 array([0, 1, 2, 3])]

Now, for example, I'd like to determine how many times does the number 0 appear in the first position via code? How many times does 1 appear in the first position? Ideally, I'd like the full distribution across the four positions, e.g. I know that 0 appears twice in the third position, 1 appears once in the third position, 2 appears twice in the third position, 3 appears five times in the third position, etc. across all positions.

CodePudding user response：

I think this works fine... there might be more performant or better solutions

import pandas,numpy

a = [numpy.array([1, 0, 3, 2]),
 numpy.array([3, 1, 0, 2]),
 numpy.array([1, 0, 3, 2]),
 numpy.array([0, 1, 3, 2]),
 numpy.array([1, 0, 2, 3]),
 numpy.array([0, 1, 3, 2]),
 numpy.array([1, 0, 3, 2]),
 numpy.array([0, 3, 1, 2]),
 numpy.array([2, 1, 0, 3]),
 numpy.array([0, 1, 2, 3])]

df = pandas.DataFrame(a)
print(df.apply(pandas.value_counts))

CodePudding user response：

You can use:

# make real 2D array
arr = np.vstack(samples)

# get unique values
u = np.unique(arr)
# array([0, 1, 2, 3])

# broadcast and count
out = (arr[:,None] == u[:,None]).sum(axis=0)

output:

#  col  0  1  2  3 
array([[4, 4, 2, 0],  # value: 0
       [4, 5, 1, 0],  # value: 1
       [1, 0, 2, 7],  # value: 2
       [1, 1, 5, 3]]) # value: 3

NB. This consumes a lot of memory on large inputs.

intermediate arr:

array([[1, 0, 3, 2],
       [3, 1, 0, 2],
       [1, 0, 3, 2],
       [0, 1, 3, 2],
       [1, 0, 2, 3],
       [0, 1, 3, 2],
       [1, 0, 3, 2],
       [0, 3, 1, 2],
       [2, 1, 0, 3],
       [0, 1, 2, 3]])