Home > Blockchain >  Naming clusters in R
Naming clusters in R

Time:09-22

I am using the iris dataset in R. I clustered the data using K-means; the output is the variable km.out. However, I cannot find an easy way to assign the cluster numbers (1-3) to a species (versicolor, setosa, virginica). I created a manual way to do it but I have to set the seed and it's very manual. There has to be a better way to do it. Any thoughts?

Here is what I did manually:

for (i in 1:length(km.out$cluster)) {
  if (km.out$cluster[i] == 1) {
    km.out$cluster[i] = "versicolor"
  }
}
for (i in 1:length(km.out$cluster)) {
  if (km.out$cluster[i] == 2) {
    km.out$cluster[i] = "setosa"
  }
}
for (i in 1:length(km.out$cluster)) {
  if (km.out$cluster[i] == 3) {
    km.out$cluster[i] = "virginica"
  }
}

CodePudding user response:

R is a vectorized language, the following one-liner is equivalent to the code in the question.

km.out$cluster <- c("versicolor", "setosa", "virginica")[km.out$cluster]

CodePudding user response:

It is not clear what you are trying to accomplish. The clusters created by kmeans will not match the Species exactly and there is no guarantee that clusters 1, 2, 3 will match the order of the species in iris. Also as you noted, the results will vary depending on the value of the seed. For example,

set.seed(42)
iris.km <- kmeans(scale(iris[, -5]), 3)
table(iris.km$cluster, iris$Species)
#    
#     setosa versicolor virginica
#   1     50          0         0
#   2      0         39        14
#   3      0         11        36

Cluster 1 is exactly associated with setosa, but cluster 2 combines versicolor and virginica as does cluster 3.

CodePudding user response:

You can recode the cluster number and add it back to the original data with:

library(dplyr)
mutate(iris, 
       cluster = case_when(km.out$cluster == 1 ~ "versicolor",
                           km.out$cluster == 2 ~ "setosa",
                           km.out$cluster == 3 ~ "virginica"))

Alternatively you can use a vector translation approach to recoding a vector with elucidate::translate()

remotes::install_github("bcgov/elucidate") #if elucidate isn't installed yet
library(dplyr)
library(elucidate)

mutate(iris, 
       cluster = translate(km.out$cluster, 
                           old = c(1:3), 
                           new =  c("versicolor", 
                                    "setosa", 
                                    "virginica")))

CodePudding user response:

if you want to assign the cluster numbers (1-3) to a species (versicolor, setosa, virginica), you'll likely not have a 1:1 correspondence. But you could assign the most frequent species in each cluster like this:

data(iris)

# k-means clustering
set.seed(5834)
km.out <- kmeans(iris[,1:4], centers = 3)

# associate species with clusters
(cmat <- table(Species = iris[,5], cluster = km.out$cluster))
#>             cluster
#> Species       1  2  3
#>   setosa     33 17  0
#>   versicolor  0  4 46
#>   virginica   0  0 50

# find the most-frequent species in each cluster
setNames(rownames(cmat)[apply(cmat, 2, which.max)], colnames(cmat))
#>           1           2           3 
#>    "setosa"    "setosa" "virginica"

# find the most-frequent assigned cluster per species
setNames(colnames(cmat)[apply(cmat, 1, which.max)], rownames(cmat))
#>     setosa versicolor  virginica 
#>        "1"        "3"        "3"

Created on 2021-09-22 by the reprex package (v2.0.1)

  • Related