How to convert the cluster id in prediction result to class label in k-means clustering prediction m-CodePudding

I am playing with this dataset (credit card fraud) and try to train a prediction model with k-means clustering. But the pre_model result are labeled by the cluster id. Therefore, when I tried to evaluate the performance of the model by confusionMatrix, it pops up an error saying the testing data and predicted result are not in same level. So, how can I convert the label in predicted result to testdata$Class which is in 1 or 0 (is fraud or not) instead of cluster id? Thanks!

Code:

data = read.csv("creditcard.csv")
data$Class <- as.factor(data$Class)
data_split <- createDataPartition(data$Class, times=1, p=0.8, list=F)
train_data <- data[data_split ,]
testdata <- data[-data_split ,]
scaled_data <- scale(train_data[-c(31)])
scaled_data <- as.matrix(scaled_data)
clust <- kmeans(scaled_data, centers = 9, nstart = 25)

pre_model <- cl_predict(clust , testdata)
confusionMatrix(testdata$Class,  pre_model , positive='1')

K-means Cluster Result:

> Clustering vector:
   [1] 7 7 6 6 7 7 8 8 3 8 6 3 6 7 7 8 7 8 7 6 6 7 7 7 3 7 7 6 7 7 7 7 7 7 7 7 7 7 7 7 3 7 7 7 8 7
  [47] 8 7 6 7 7 1 7 7 8 7 7 3 7 7 8 6 7 3 7 7 8 7 7 7 7 3 7 7 7 7 7 7 7 3 7 7 9 6 7 7 7 7 3 1 7 7
  [93] 7 7 3 7 7 7 7 7 7 7 7 7 8 3 7 7 7 8 7 7 7 7 7 7 7 7 7 7 8 7 6 8 8 7 8 8 8 7 8 7 7 7 7 7 6 8
 [139] 7 8 8 6 7 3 8 3 9 7 7 7 8 7 7 8 7 3 7 7 7 7 6 7 7 7 1 7 7 7 7 7 7 6 7 7 8 7 7 7 8 6 7 8 7 8....

Error in ConfusionMatrix:

Error: data and reference should be factors with the same levels.

CodePudding user response：

Your analysis is complicated by the fact that cluster analysis is an unsupervised approach. That is, it does not try to predict the original cluster assignment. It just groups the data based on the independent variables. Lets follow your code using the iris data set which has only 3 groups instead of the 9 groups that your data seems to have.

library(caret)
library(clue)
set.seed(42)
data(iris)
Class <- iris$Species
iris.z <- scale(iris[, -5])
iris_split <- createDataPartition(iris$Species, times=1, p=0.8, list=FALSE)
iris_train <- iris.z[iris_split, ]
iris_test <- iris.z[-iris_split, ]
Class_train <- Class[iris_split]
Class_test <- Class[-iris_split]

This follows your code pretty closely with some minor adjustments. We load the necessary packages, caret and clue which you did not include in your code. We set the seed for the random number generator because kmeans uses a random initial assignment so the results can vary from one run to the next. Secondly we scale all of the data so the train and test data sets are on the same scale, Then we create train and test subsets of the raw data and the group membership. Now the cluster analysis:

Clust_train <- kmeans(iris_train, centers=3, nstart=25)
table(Clust_train$cluster, Class_train)
#    Class_train
#     setosa versicolor virginica
#   1      0          9        31
#   2      0         31         9
#   3     40          0         0

Notice that there is no guarantee that the clusters will match the original group designations. With 3 clusters/groups it is straightforward to recognize the cluster 1 is primarily composed of virginica, cluster 2 of versicolor, and cluster 3 of setosa. However cluster 1 also includes 9 specimens of versicolor and cluster 2 also includes 9 specimens of virginica. With 9 clusters the results may not be as straightforward.

The next converts the cluster number to the most likely species designation:

train_pre <- factor(ifelse(Clust_train$cluster==1, 3, ifelse(Clust_train$cluster==2, 2, 1)), labels=levels(iris$Species))
tbl_train <- table(train_pre, Class_train)
tbl_train
#             Class_train
# train_pre    setosa versicolor virginica
#   setosa         40          0         0
#   versicolor      0         31         9
#   virginica       0          9        31
sum(diag(tbl_train))/sum(tbl_train) * 100
# [1] 85

So the training cluster analysis was about 85% accurate in grouping specimens of the same species together. Now assign the test group to clusters:

pre_model <- cl_predict(Clust_train, iris_test)
test_pre <- factor(ifelse(pre_model==1, 3, ifelse(pre_model==2, 2, 1)), labels=levels(iris$Species))
confusionMatrix(Class_test,  test_pre, positive='1')
# Confusion Matrix and Statistics
# 
#             Reference
# Prediction   setosa versicolor virginica
#   setosa         10          0         0
#   versicolor      0          8         2
#   virginica       0          5         5
#           .  .  .