perform numpy mean over matrix using labels as indicators-CodePudding

import numpy as np    
arr = np.random.random((5, 3))
labels = [1, 1, 2, 2, 3]
arr
Out[136]: 
array([[0.20349907, 0.1330621 , 0.78268978],
       [0.71883378, 0.24783927, 0.35576746],
       [0.17760916, 0.25003952, 0.29058267],
       [0.90379712, 0.78134806, 0.49941208],
       [0.08025936, 0.01712403, 0.53479622]])
labels
Out[137]: [1, 1, 2, 2, 3]

assume I have this dataset. I would like, using the labels as indicators, to perform np.mean over the rows.

(The labels here indicates the class of each row. labels could also be [0, 1, 1, 0, 4, 1, 4] So have no assumptions over them.)

So the output here will be an average over the:

1st and 2nd row.
3rd and 4th row.
5th row.

in the most efficient way numpy offers. like so:

[np.mean(arr[:2], axis=0),
np.mean(arr[2:4], axis=0),
np.mean(arr[4:], axis=0)]
Out[180]: 
[array([0.46116642, 0.19045069, 0.56922862]),
 array([0.54070314, 0.51569379, 0.39499737]),
 array([0.08025936, 0.01712403, 0.53479622])]

(in real life scenario the matrix dimensions could be (100000, 256))

CodePudding user response：

First we would like to sort our label and matrix:

labels = np.array(labels)
# Getting the indices of a sorted array
sorted_indices = np.argsort(labels)
# Use the indices to sort both labels and matrix
sorted_labels = labels[sorted_indices]
sorted_matrix = matrix[sorted_indices]

Then, we calculate the "steps" or pairs of indices, (from, to) we want to calculate average over, We sum them and divide by their count.

# Here we're getting the amount of rows per label to average (over the sorted_matrix). 
# Infact, we're getting the start and end indices per label.
label_indices = np.concatenate(([0], np.where(np.diff(sorted_labels) != 0)[0]   1, [len(sorted_labels)]))

# using add   reduceat to add all rows with regard to the label indices
group_sums = np.add.reduceat(sorted_matrix, label_indices[:-1], axis=0)
# getting count for each group using the diff in label_indices
group_counts = np.diff(label_indices)
# Calculating the mean
group_means = group_sums / group_counts[:, np.newaxis]

Example:

matrix
Out[265]: 
array([[0.69524902, 0.22105336, 0.65631557, 0.54823511, 0.25248685],
       [0.61675048, 0.45973729, 0.22410694, 0.71403135, 0.02391662],
       [0.02559926, 0.41640708, 0.27931808, 0.29139379, 0.76402121],
       [0.27166955, 0.79121862, 0.23512671, 0.32568048, 0.38712154],
       [0.94519182, 0.99834516, 0.23381289, 0.40722346, 0.95857389],
       [0.01685432, 0.8395658 , 0.73460083, 0.08056013, 0.02522956],
       [0.27274409, 0.64602305, 0.05698037, 0.23214598, 0.75130743],
       [0.65069115, 0.32383729, 0.86316629, 0.69659358, 0.26667206],
       [0.91971818, 0.02011127, 0.91776206, 0.79474582, 0.39678431],
       [0.94645805, 0.18057829, 0.23292538, 0.93111373, 0.44815706]])

labels
Out[266]: array([3, 3, 2, 3, 1, 0, 2, 0, 2, 5])

group_means 
Out[267]: 
array([[0.33377274, 0.58170155, 0.79888356, 0.38857686, 0.14595081],
       [0.94519182, 0.99834516, 0.23381289, 0.40722346, 0.95857389],
       [0.40602051, 0.36084713, 0.41802017, 0.43942853, 0.63737099],
       [0.52788969, 0.49066976, 0.37184974, 0.52931565, 0.221175  ],
       [0.94645805, 0.18057829, 0.23292538, 0.93111373, 0.44815706]])

and the results are suited for: np.unique(sorted_labels)

np.unique(sorted_labels)
Out[271]: array([0, 1, 2, 3, 5])

CodePudding user response：

I did not understand the labels part in your question. but there is a way to calculate the mean of each row in a matrix. use --> np.mean(arr, axis = 1).

If lables to be used, please go through below mentioned script.

import numpy as np
arr = np.array([[1,2,3],
    [4,5,6],
    [7,8,9],
    [1,2,3],
    [4,5,6]])
labels =np.array([0, 1, 1, 0, 4])
#print(arr)
#print('LABEL IS :', labels)
#print('MEAN VALUES ARE : ',np.mean(arr[:2], axis = 1))
id = labels.argsort()
eq_lal = labels[id]
print(eq_lal)
print(arr[eq_lal])
print(np.mean(arr[eq_lal], axis = 1))