Home > Mobile >  Mutating a list of strings to include K-means cluster number in R
Mutating a list of strings to include K-means cluster number in R

Time:09-01

I am following this tutorial and this is all super new to me, so apologies if it's an obvious question.

Following the tutorial, I have converted a list of strings, sentences, into a K-Means model with 4 clusters using the following:

corpus = tm::Corpus(tm::VectorSource(sentences)) 
tdm <- tm::DocumentTermMatrix(corpus.cleaned) 
tdm.tfidf <- tm::weightTfIdf(tdm)
tdm.tfidf <- tm::removeSparseTerms(tdm.tfidf, 0.999) 
tfidf.matrix <- as.matrix(tdm.tfidf) 
dist.matrix = proxy::dist(tfidf.matrix, method = "cosine")
model <- kmeans(dist.matrix, centers = 4)

Now, I would like to go back to the original list of sentences and show next to each one which cluster it forms part of. For example:

sentences cluster
Lorem ipsum dolor sit amet 1
Consectetur adipiscing elit 2

I've tried the following (using the dplyr package):

clustered <- mutate(sentences, cluster = model$cluster)

and

clustered <- mutate(df$sentences, cluster = model$cluster)

But obviously this doesn't work, because as R says, "no applicable method for 'mutate' applied to an object of class "character".

Any ideas?

CodePudding user response:

Without data to test it, if I got it right, sentences is a list of strings, which you can use to create a column in a new dataframe, and model$cluster is an array where every position/index is related to the same one from the input. So, if the order of the list was kept, they are related. If this is true (I don't know because I never used tm library) you can just create a new dataframe with the list and the array.

kmeans_results = data.frame(
  sentence = sentences,
  clusterID = model$cluster,
)
  • Related