I am following this tutorial and this is all super new to me, so apologies if it's an obvious question.
Following the tutorial, I have converted a list of strings, sentences
, into a K-Means model with 4 clusters using the following:
corpus = tm::Corpus(tm::VectorSource(sentences))
tdm <- tm::DocumentTermMatrix(corpus.cleaned)
tdm.tfidf <- tm::weightTfIdf(tdm)
tdm.tfidf <- tm::removeSparseTerms(tdm.tfidf, 0.999)
tfidf.matrix <- as.matrix(tdm.tfidf)
dist.matrix = proxy::dist(tfidf.matrix, method = "cosine")
model <- kmeans(dist.matrix, centers = 4)
Now, I would like to go back to the original list of sentences
and show next to each one which cluster it forms part of. For example:
sentences | cluster |
---|---|
Lorem ipsum dolor sit amet | 1 |
Consectetur adipiscing elit | 2 |
I've tried the following (using the dplyr package):
clustered <- mutate(sentences, cluster = model$cluster)
and
clustered <- mutate(df$sentences, cluster = model$cluster)
But obviously this doesn't work, because as R says, "no applicable method for 'mutate' applied to an object of class "character".
Any ideas?
CodePudding user response:
Without data to test it, if I got it right, sentences
is a list of strings, which you can use to create a column in a new dataframe, and model$cluster
is an array where every position/index is related to the same one from the input. So, if the order of the list was kept, they are related. If this is true (I don't know because I never used tm
library) you can just create a new dataframe with the list and the array.
kmeans_results = data.frame(
sentence = sentences,
clusterID = model$cluster,
)