Here's the question and the example given:
You are given a 2-d array A of size NxN containing floating-point numbers. The array represents pairwise correlation between N elemenets with A[i,j] = A[j,i] = corr(i,j) and A[i,i] = 1.
Write a Python program using NumPy to find the index of the highest correlated element for each element and finally print the sum of all these indexes.
Example: The array A = [[1, 0.3, 0.4], [0.4,1,0.5],[0.1,0.6,1]]. Then, the indexes of the highest correlated elements for each element are [3, 3, 2]. the sum of these indexes is 8.
I'm having trouble understanding the question, but the example makes my confusion worse. With each array inside A having only 3 values, and A itself having only three arrays inside how can any "index of the highest correlated elements" being greater than 2 if numpy is zero indexed?
Does anyone understand the question?
CodePudding user response:
To reiterate, the example is wrong in multiple ways.
Correlation matrices are by definition symmetric, yet the example is not:
array([[1. , 0.3, 0.4],
[0.4, 1. , 0.5],
[0.1, 0.6, 1. ]])
Also you are right, numpy arrays (like everything else I know in Python that supports indexing) are zero-indexed. So the solution is off by one.
The exercise wants you to find the index j
of the random variable with the greatest correlation for each random variable with index i
. Obviously excluding itself (the correlation coefficient of 1
on the diagonal).
Here is one way to do that given your numpy array a
:
np.where(a != 1, a, 0).argmax(axis=1)
Here np.where
produces an array identical to a
except we replace the ones with zeroes. This is based on the assumption that if i != j
, the correlation is always < 1
. If that does not hold, the solution will obviously be wrong.
Then argmax
gives the indices of the greatest values in each row. Although, in an actual correlation matrix, axis=0
would work just as well, since it would be... you know... symmetrical.
The result is array([2, 2, 1])
. To get the sum, you just add a .sum()
at the end.
EDIT:
Now that I think about it, the assumption is too strong. Here is a better way:
b = a.copy()
np.fill_diagonal(b, -1)
b.argmax(axis=1)
Now we only assume that actual correlations can never be < 0
, which I think is reasonable. If you don't care about mutating the original array, you could obviously omit the copy
and fill the diagonal of a
with -1.
instead.