Taking mean of all rows in a numpy matrix grouped by values based on another numpy matrix-CodePudding

I have a matrix A of size NXN with float values and another boolean matrix B of size NXN

For every row, I need to find the mean of all values in A belonging to indices where True is the corresponding value for that index in matrix B

Similarly, I need to find the mean of all values in A belonging to indices where False is the corresponding value for that index in matrix B

Finally, I need to find the count of number of rows where "True" mean is lesser than "False" mean

For example :

A = [[1.0, 2.0, 3.0]
     [4.0, 5.0, 6.0]
     [7.0, 8.0, 9.0]]

B = [[True, True, False]
     [False, False, True]
     [True, False, True]]

Initially, count = 0

For row 1, true_mean = 1.0 2.0 / 2 = 1.5 and false_mean = 3.0
true_mean < false_mean, so count = 0 1=1

For row 2, true_mean = 6.0 and false_mean = 4.0 5.0 / 2 = 4.5
true_mean > false_mean, so count remains same

For row 3, true_mean = 7.0 9.0 / 2 = 8.0 and false_mean = 8.0
true_mean == false_mean, so count remains same

Final count value = 1

My attempt:-

true_mat = np.where(B, A, 0)
false_mat = np.where(B, 0, A)

true_mean = true_mat.mean(axis=1)
false_mean = false_mat.mean(axis=1)

But this actually gives wrong answer since denominator is not exactly the count of number of True/False values in that row but instead 'N'

I only need the count, I don't need the true_mean and false_mean

Anyway to fix it?

CodePudding user response：

The mean issue can be resolved by computing a mask:

mask_norm = tf.reduce_sum(tf.clip_by_value(true_mat, 0., 1.),axis=0)
true_mean = tf.math.divide(tf.reduce_sum(true_mat, axis=1), mask_norm)
#true_mean : [1.5, 6. , 8. ]

You can find the count using tf.reduce_sum(tf.where(true_mean < false_mean, 1, 0))

CodePudding user response：

You could also try something like this:

import tensorflow as tf


A = tf.constant([[1.0, 2.0, 3.0],
     [4.0, 5.0, 6.0],
     [7.0, 8.0, 9.0]])

B = tf.constant([[True, True, False],
                [False, False, True],
                [True, False, True]])

t_rows = tf.where(B)
f_rows = tf.where(~B)
_true = tf.gather_nd(A, t_rows)
_false = tf.gather_nd(A,  f_rows)

count = tf.reduce_sum(tf.cast(tf.math.greater(tf.math.segment_mean(_false, f_rows[:, 0]), tf.math.segment_mean(_true, t_rows[:, 0])), dtype=tf.int32))
tf.print(count)

Works also with rows that are all True or False:

B = tf.constant([[True, True, True],
                [False, False, True],
                [True, False, True]])
# 0

B = tf.constant([[False, False, False],
                [False, False, False],
                [True, False, True]])
# 2

CodePudding user response：

I would say your start is good

true_mat = np.where(B, A, 0)
false_mat = np.where(B, 0, A)

But we want to divide by the number of Trues or Falses, respectively, so...

true_sum = np.sum(B, axis = 1) #sum of Trues per row
false_sum = N-true_sum         # if you don't have N given, do N=A.shape[0]

true_mean = np.sum(true_mat, axis = 1)/true_sum      #add up rows of true_mat and divide by true_sum
false_mean = np.sum(false_mat, axis = 1)/false_sum

For your example this gives

[1.5 6.  8. ]
[3.  4.5 8. ]

So now we just have to compare where the second is larger than the first:

count = np.sum(np.where(false_mean > true_mean, 1, 0))