The efficient way to compare value between two cell and assign value based on condition in Numpy-CodePudding

The objective is to count the frequency when two nodes have similar value.

Say, for example, we have a vector

pd.DataFrame([0,4,1,1,1],index=['A','B','C','D','E'])

as below

and, the element Nij is equal to 1 if nodes i and j have similar value and is equal to zero otherwise.

N is then

    A   B   C   D   E
A   1   0   0   0   0
B   0   1   0   0   0
C   0   0   1   1   1
D   0   0   1   1   1
E   0   0   1   1   1

This simple example can be extended to 2D. For example, here create array of shape (4,5)

   A  B  C  D  E
0  0  0  0  0  0
1  0  4  1  1  1
2  0  1  1  2  2
3  0  3  2  2  2

Similarly, we go row wise and set the element Nij is equal to 1 if nodes i and j have similar value and is equal to zero otherwise. At every iteration of the row, we sum the cell value.

The frequency is then equal to

     A    B    C    D    E
A  4.0  1.0  1.0  1.0  1.0
B  1.0  4.0  2.0  1.0  1.0
C  1.0  2.0  4.0  3.0  3.0
D  1.0  1.0  3.0  4.0  4.0
E  1.0  1.0  3.0  4.0  4.0

Based on this, the following code is proposed. But, the current implementation used 3 for-loops and some if-else statement.

I am curios whether the code below can be enhanced further, or maybe, there is a build-in method within Pandas or Numpy that can be used to achieve similar objective.

import numpy as np

arr=[[ 0,0,0,0,0],
    [0,4,1,1,1],
    [0,1,1,2,2],
   [0,3,2,2,2]]
arr=np.array(arr)
# C=arr

# nrows
npart = len(arr[:,0])

# Ncolumns
m = len(arr[0,:])
X = np.zeros(shape =(m,m), dtype = np.double)
for i in range(npart):
    for k in range(m):
        for p in range(m):

                # Check whether the pair have similar value or not
                if arr[i,k] == arr[i,p]:
                    X[k,p] = X[k,p]   1
                else:
                    X[k,p] = X[k,p]   0

Output

4.00000,1.00000,1.00000,1.00000,1.00000
1.00000,4.00000,2.00000,1.00000,1.00000
1.00000,2.00000,4.00000,3.00000,3.00000
1.00000,1.00000,3.00000,4.00000,4.00000
1.00000,1.00000,3.00000,4.00000,4.00000

p.s. The index A,B,C,D,E and use of pandas are for clarification purpose. But, suggestion using pandas are welcome

CodePudding user response：

With numpy, you can use broadcasting:

1D

a = np.array([0,4,1,1,1])
(a==a[:, None])*1

output:

array([[1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 1, 1],
       [0, 0, 1, 1, 1],
       [0, 0, 1, 1, 1]])

2D

a = np.array([[0, 0, 0, 0, 0],
              [0, 4, 1, 1, 1],
              [0, 1, 1, 2, 2],
              [0, 3, 2, 2, 2]])

(a.T == a.T[:,None]).sum(2)

output:

array([[4, 1, 1, 1, 1],
       [1, 4, 2, 1, 1],
       [1, 2, 4, 3, 3],
       [1, 1, 3, 4, 4],
       [1, 1, 3, 4, 4]])