Having trouble interpreting a Numpy question-CodePudding

Here's the question and the example given:

You are given a 2-d array A of size NxN containing floating-point numbers. The array represents pairwise correlation between N elemenets with A[i,j] = A[j,i] = corr(i,j) and A[i,i] = 1.

Write a Python program using NumPy to find the index of the highest correlated element for each element and finally print the sum of all these indexes.

Example: The array A = [[1, 0.3, 0.4], [0.4,1,0.5],[0.1,0.6,1]]. Then, the indexes of the highest correlated elements for each element are [3, 3, 2]. the sum of these indexes is 8.

I'm having trouble understanding the question, but the example makes my confusion worse. With each array inside A having only 3 values, and A itself having only three arrays inside how can any "index of the highest correlated elements" being greater than 2 if numpy is zero indexed?

Does anyone understand the question?

CodePudding user response：

To reiterate, the example is wrong in multiple ways.

Correlation matrices are by definition symmetric, yet the example is not:

array([[1. , 0.3, 0.4],
       [0.4, 1. , 0.5],
       [0.1, 0.6, 1. ]])

Also you are right, numpy arrays (like everything else I know in Python that supports indexing) are zero-indexed. So the solution is off by one.

The exercise wants you to find the index j of the random variable with the greatest correlation for each random variable with index i. Obviously excluding itself (the correlation coefficient of 1 on the diagonal).

Here is one way to do that given your numpy array a:

np.where(a != 1, a, 0).argmax(axis=1)

Here np.where produces an array identical to a except we replace the ones with zeroes. This is based on the assumption that if i != j, the correlation is always < 1. If that does not hold, the solution will obviously be wrong.

Then argmax gives the indices of the greatest values in each row. Although, in an actual correlation matrix, axis=0 would work just as well, since it would be... you know... symmetrical.

The result is array([2, 2, 1]). To get the sum, you just add a .sum() at the end.

EDIT:

Now that I think about it, the assumption is too strong. Here is a better way:

b = a.copy()
np.fill_diagonal(b, -1)
b.argmax(axis=1)

Now we only assume that actual correlations can never be < 0, which I think is reasonable. If you don't care about mutating the original array, you could obviously omit the copy and fill the diagonal of a with -1. instead.