Applying RAND index with cluster numbers and cluster labels-CodePudding

I have a set of reviews and I've clustered them with k-means and got the clusters each review belongs to (Ex: 1,2,3...). I also have the real labels of which clusters these belongs to Ex: location, food etc.) and I need to compare them with Rand index.

As I have cluster numbers and cluster labels how I can I apply Rand index to compare?

Is there any intermediate step that I should follow?

Edit: I've seen the post Rand Index function (clustering performance evaluation) but it does not answer my question.

In that question, you have

labels_true = [1, 1, 0, 0, 0, 0]
labels_pred = [0, 0, 0, 1, 0, 1]

but what I have is something like below,

labels_true = ['food', 'view', 'room', 'food', 'staff', 'staff']
labels_pred = [0, 0, 0, 1, 0, 1]

Any help is highly appreciated.

CodePudding user response：

Just use the sklearn.metrics.rand_score function:

from sklearn.metrics import rand_score

rand_score(labels_true, labels_pred)

It doesn't matter if true labels and predicted labels have values in different domains. Please have a look at the examples:

>>> rand_score(['a', 'b', 'c'], [5, 6, 7])
1.0
>>> rand_score([0, 1, 2], [5, 6, 7])
1.0
>>> rand_score(['a', 'a', 'b'], [0, 1, 2])
0.6666666666666666
>>> rand_score(['a', 'a', 'b'], [7, 7, 2])
1.0