I have a set of reviews and I've clustered them with k-means and got the clusters each review belongs to (Ex: 1,2,3...). I also have the real labels of which clusters these belongs to Ex: location, food etc.) and I need to compare them with Rand index.
As I have cluster numbers and cluster labels how I can I apply Rand index to compare?
Is there any intermediate step that I should follow?
Edit: I've seen the post Rand Index function (clustering performance evaluation) but it does not answer my question.
In that question, you have
labels_true = [1, 1, 0, 0, 0, 0]
labels_pred = [0, 0, 0, 1, 0, 1]
but what I have is something like below,
labels_true = ['food', 'view', 'room', 'food', 'staff', 'staff']
labels_pred = [0, 0, 0, 1, 0, 1]
Any help is highly appreciated.
CodePudding user response:
Just use the sklearn.metrics.rand_score
function:
from sklearn.metrics import rand_score
rand_score(labels_true, labels_pred)
It doesn't matter if true labels and predicted labels have values in different domains. Please have a look at the examples:
>>> rand_score(['a', 'b', 'c'], [5, 6, 7])
1.0
>>> rand_score([0, 1, 2], [5, 6, 7])
1.0
>>> rand_score(['a', 'a', 'b'], [0, 1, 2])
0.6666666666666666
>>> rand_score(['a', 'a', 'b'], [7, 7, 2])
1.0