Histogramming boolean numpy arrays by giving each array a unique label-CodePudding

I have a large sample (M) of boolean arrays of length 'N'. So there are 2^N unique boolean arrays possible. I would like to know how many arrays are duplicates of each other and create a histogram.

One possibility is to create a unique integer (a[0] a[1]*2 a[3]*4 ... a[N]*2^(N-1)) for each unique array and histogram that integer.

But this is going to be O(M*N). What is the best way to do this?

CodePudding user response：

numpy.ravel_multi_index is able to do this for you:

arr = np.array([[True, True, True], 
              [True, False, True], 
              [True, False, True], 
              [False, False, False], 
              [True, False, True]], dtype=int)
nums = np.ravel_multi_index(arr.T, (2,) * arr.shape[1])
>>> nums
array([7, 5, 5, 0, 5], dtype=int64)

And since you need a histogram, use

>>> np.histogram(nums, bins=np.arange(2**arr.shape[1] 1))
(array([1, 0, 0, 0, 0, 3, 0, 1], dtype=int64), 
 array([0, 1, 2, 3, 4, 5, 6, 7, 8]))

Another option is to use np.unique:

>>> np.unique(arr, return_counts=True, axis=0)
(array([[0, 0, 0],
        [1, 0, 1],
        [1, 1, 1]]),
 array([1, 3, 1], dtype=int64))

CodePudding user response：

With vectorized operation, the creation of a key is much more faster than a[0] a[1]x2 a[3]x4 ... a[N]*2^(N-1). I think that's no a better solution... in any case you need to almost "read" one time each value, and this require MxN step.

N = 3
M = 5
sample = [
    np.array([True, True, True]),
    np.array([True, False, True]),
    np.array([True, False, True]),
    np.array([False, False, False]),
    np.array([True, False, True]),
]

multipliers = [2<<i for i in range(N-2, -1, -1)] [1]

buckets = {}
buck2vec = {}
for s in sample:
    key = sum(s*multipliers)
    if key not in buckets:
        buckets[key] = 0
        buck2vec[key] = s
    buckets[key] =1
    
for key in buckets:
    print(f"{buck2vec[key]} -> {buckets[key]} occurency")

Results:

[False False False] -> 1 occurency
[ True False  True] -> 3 occurency
[ True  True  True] -> 1 occurency