I am playing with this dataset (credit card fraud) and try to train a prediction model
with k-means clustering. But the pre_model
result are labeled by the cluster
id. Therefore, when I tried to evaluate the performance of the model by confusionMatrix
, it pops up an error saying the testing data and predicted result are not in same level
. So, how can I convert the label
in predicted result to testdata$Class
which is in 1
or 0
(is fraud or not) instead of cluster id? Thanks!
Code:
data = read.csv("creditcard.csv")
data$Class <- as.factor(data$Class)
data_split <- createDataPartition(data$Class, times=1, p=0.8, list=F)
train_data <- data[data_split ,]
testdata <- data[-data_split ,]
scaled_data <- scale(train_data[-c(31)])
scaled_data <- as.matrix(scaled_data)
clust <- kmeans(scaled_data, centers = 9, nstart = 25)
pre_model <- cl_predict(clust , testdata)
confusionMatrix(testdata$Class, pre_model , positive='1')
K-means Cluster Result:
> Clustering vector:
[1] 7 7 6 6 7 7 8 8 3 8 6 3 6 7 7 8 7 8 7 6 6 7 7 7 3 7 7 6 7 7 7 7 7 7 7 7 7 7 7 7 3 7 7 7 8 7
[47] 8 7 6 7 7 1 7 7 8 7 7 3 7 7 8 6 7 3 7 7 8 7 7 7 7 3 7 7 7 7 7 7 7 3 7 7 9 6 7 7 7 7 3 1 7 7
[93] 7 7 3 7 7 7 7 7 7 7 7 7 8 3 7 7 7 8 7 7 7 7 7 7 7 7 7 7 8 7 6 8 8 7 8 8 8 7 8 7 7 7 7 7 6 8
[139] 7 8 8 6 7 3 8 3 9 7 7 7 8 7 7 8 7 3 7 7 7 7 6 7 7 7 1 7 7 7 7 7 7 6 7 7 8 7 7 7 8 6 7 8 7 8....
Error in ConfusionMatrix:
Error:
data
andreference
should be factors with the same levels.
CodePudding user response:
Your analysis is complicated by the fact that cluster analysis is an unsupervised approach. That is, it does not try to predict the original cluster assignment. It just groups the data based on the independent variables. Lets follow your code using the iris
data set which has only 3 groups instead of the 9 groups that your data seems to have.
library(caret)
library(clue)
set.seed(42)
data(iris)
Class <- iris$Species
iris.z <- scale(iris[, -5])
iris_split <- createDataPartition(iris$Species, times=1, p=0.8, list=FALSE)
iris_train <- iris.z[iris_split, ]
iris_test <- iris.z[-iris_split, ]
Class_train <- Class[iris_split]
Class_test <- Class[-iris_split]
This follows your code pretty closely with some minor adjustments. We load the necessary packages, caret
and clue
which you did not include in your code. We set the seed for the random number generator because kmeans
uses a random initial assignment so the results can vary from one run to the next. Secondly we scale all of the data so the train and test data sets are on the same scale, Then we create train and test subsets of the raw data and the group membership. Now the cluster analysis:
Clust_train <- kmeans(iris_train, centers=3, nstart=25)
table(Clust_train$cluster, Class_train)
# Class_train
# setosa versicolor virginica
# 1 0 9 31
# 2 0 31 9
# 3 40 0 0
Notice that there is no guarantee that the clusters will match the original group designations. With 3 clusters/groups it is straightforward to recognize the cluster 1 is primarily composed of virginica, cluster 2 of versicolor, and cluster 3 of setosa. However cluster 1 also includes 9 specimens of versicolor and cluster 2 also includes 9 specimens of virginica. With 9 clusters the results may not be as straightforward.
The next converts the cluster number to the most likely species designation:
train_pre <- factor(ifelse(Clust_train$cluster==1, 3, ifelse(Clust_train$cluster==2, 2, 1)), labels=levels(iris$Species))
tbl_train <- table(train_pre, Class_train)
tbl_train
# Class_train
# train_pre setosa versicolor virginica
# setosa 40 0 0
# versicolor 0 31 9
# virginica 0 9 31
sum(diag(tbl_train))/sum(tbl_train) * 100
# [1] 85
So the training cluster analysis was about 85% accurate in grouping specimens of the same species together. Now assign the test group to clusters:
pre_model <- cl_predict(Clust_train, iris_test)
test_pre <- factor(ifelse(pre_model==1, 3, ifelse(pre_model==2, 2, 1)), labels=levels(iris$Species))
confusionMatrix(Class_test, test_pre, positive='1')
# Confusion Matrix and Statistics
#
# Reference
# Prediction setosa versicolor virginica
# setosa 10 0 0
# versicolor 0 8 2
# virginica 0 5 5
# . . .