I am using the iris dataset in R. I clustered the data using K-means; the output is the variable km.out. However, I cannot find an easy way to assign the cluster numbers (1-3) to a species (versicolor, setosa, virginica). I created a manual way to do it but I have to set the seed and it's very manual. There has to be a better way to do it. Any thoughts?
Here is what I did manually:
for (i in 1:length(km.out$cluster)) {
if (km.out$cluster[i] == 1) {
km.out$cluster[i] = "versicolor"
}
}
for (i in 1:length(km.out$cluster)) {
if (km.out$cluster[i] == 2) {
km.out$cluster[i] = "setosa"
}
}
for (i in 1:length(km.out$cluster)) {
if (km.out$cluster[i] == 3) {
km.out$cluster[i] = "virginica"
}
}
CodePudding user response:
R is a vectorized language, the following one-liner is equivalent to the code in the question.
km.out$cluster <- c("versicolor", "setosa", "virginica")[km.out$cluster]
CodePudding user response:
It is not clear what you are trying to accomplish. The clusters created by kmeans
will not match the Species
exactly and there is no guarantee that clusters 1, 2, 3 will match the order of the species in iris
. Also as you noted, the results will vary depending on the value of the seed. For example,
set.seed(42)
iris.km <- kmeans(scale(iris[, -5]), 3)
table(iris.km$cluster, iris$Species)
#
# setosa versicolor virginica
# 1 50 0 0
# 2 0 39 14
# 3 0 11 36
Cluster 1 is exactly associated with setosa, but cluster 2 combines versicolor and virginica as does cluster 3.
CodePudding user response:
You can recode the cluster number and add it back to the original data with:
library(dplyr)
mutate(iris,
cluster = case_when(km.out$cluster == 1 ~ "versicolor",
km.out$cluster == 2 ~ "setosa",
km.out$cluster == 3 ~ "virginica"))
Alternatively you can use a vector translation approach to recoding a vector with elucidate::translate()
remotes::install_github("bcgov/elucidate") #if elucidate isn't installed yet
library(dplyr)
library(elucidate)
mutate(iris,
cluster = translate(km.out$cluster,
old = c(1:3),
new = c("versicolor",
"setosa",
"virginica")))
CodePudding user response:
if you want to assign the cluster numbers (1-3) to a species (versicolor, setosa, virginica), you'll likely not have a 1:1 correspondence. But you could assign the most frequent species in each cluster like this:
data(iris)
# k-means clustering
set.seed(5834)
km.out <- kmeans(iris[,1:4], centers = 3)
# associate species with clusters
(cmat <- table(Species = iris[,5], cluster = km.out$cluster))
#> cluster
#> Species 1 2 3
#> setosa 33 17 0
#> versicolor 0 4 46
#> virginica 0 0 50
# find the most-frequent species in each cluster
setNames(rownames(cmat)[apply(cmat, 2, which.max)], colnames(cmat))
#> 1 2 3
#> "setosa" "setosa" "virginica"
# find the most-frequent assigned cluster per species
setNames(colnames(cmat)[apply(cmat, 1, which.max)], rownames(cmat))
#> setosa versicolor virginica
#> "1" "3" "3"
Created on 2021-09-22 by the reprex package (v2.0.1)