2x2 contingency matrix:
Cj
2 1
Ci
1 0
Translates to:
[[ 0 0 0 1 ]
[ 0 0 1 0 ]]
The contingency matrix represents the outcome of two clustering algorithms, each with two clusters. The first row indicates that Ci
has three data points in, say, cluster 1 and one data point in, say, cluster 2. Cj
has three data points in, say, cluster A and 1 data point in, say, cluster B. Therefore, both algorithms "agree" on two out of N = 4 data points.
Since there does not exist an adjusted mutual information function that takes in the contingency matrix as input, I would like to transform the contingency matrix to 1d inputs for the sklearn implementation of AMI.
Is there an efficient way to re-write a NxN contingency matrix in 1D vector form in Python code?
It would look something like:
V1
V2
For i row index
For j column index
Append as many as contingency_ij elements with value i to V1 and with value j to V2
CodePudding user response:
Well, this solves the problem as you have stated it. The final matrix v
can be converted to numpy. v
would need as many empty elements as there are dimensions in c
.
c = [[2,1],[1,0]]
v = [[],[]]
for i,row in enumerate(c):
for j,val in enumerate(row):
v[0].extend( [i]*val )
v[1].extend( [j]*val )
print(v)
CodePudding user response:
A numpy implementation could take advantage of numpy.repeat
:
# input contingency matrix
a = np.array([[2,1],[1,0]])
# fixed "cluster id" matrix
b = np.array([[0,1],[0,1]])
out = np.vstack([np.repeat(b.ravel('F'), a.ravel()),
np.repeat(b.ravel(), a.ravel())
])
Output:
array([[0, 0, 0, 1],
[0, 0, 1, 0]])
Other example with [[5,4],[0,3]]
as input:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])
You can also use cluster ids other that 0/1, if wanted (example with a = np.array([[5,4],[0,3]]) ; b = np.array([[0,1],[2,3]])
):
array([[0, 0, 0, 0, 0, 2, 2, 2, 2, 3, 3, 3],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3]])