Home > Mobile >  Transition matrix for cluster analysis in R
Transition matrix for cluster analysis in R

Time:12-07

I have the following dataset, where the column clust is the initial cluster and lt_clust is the resulting cluster after some time:

dataset <- data.frame(Id = c(101, 102, 103, 104, 105, 106, 107, 108, 
                             109, 110, 111, 112, 113, 114), 
                      clust = c("k1", "k1", "k1", "k1","k1", "k2", "k2", 
                                "k2", "k2", "k2", "k3", "k3", "k3", "k3"), 
                      lt_clust = c("k2", "k1", "k1", "k1", "k1", "k2", "k3", 
                                   "k1", "k2", "k2", "k3", "k3", "k1", "k3"),
                      stringsAsFactors = FALSE)

Now I want to test how much I was accurate when assigning the final cluster, so the expected result is:

  clust lt_clust rate
  <fct> <fct>    <dbl>
1 k1    k1         0.8
2 k1    k2         0.2
3 k1    k3           0
4 k2    k1         0.2
5 k2    k2         0.6
6 k2    k3         0.2
7 k3    k1        0.25
8 k3    k2           0
9 k3    k3        0.75

This was my first attempt:

dataset %>% 
  mutate(clust = as.factor(clust),
         lt_clust = as.factor(lt_clust),
         tick = 1) %>%
  group_by(clust, lt_clust, .drop = FALSE) %>%
  summarise(total = sum(tick)) %>%
  ungroup() %>%
  group_by(clust, ) %>%
  summarise(rate = total / sum(total))

But I fail to capture the lt_clust column:

  clust  rate
  <fct> <dbl>
1 k1     0.8 
2 k1     0.2 
3 k1     0   
4 k2     0.2 
5 k2     0.6 
6 k2     0.2 
7 k3     0.25
8 k3     0   
9 k3     0.75

And when I try this, the result is wrong too:

dataset %>% 
  mutate(clust = as.factor(clust),
         lt_clust = as.factor(lt_clust),
         tick = 1) %>%
  group_by(clust, lt_clust, .drop = FALSE) %>%
  summarise(total = sum(tick),
            rate = total / sum(total))  
  clust lt_clust total  rate
  <fct> <fct>    <dbl> <dbl>
1 k1    k1           4     1
2 k1    k2           1     1
3 k1    k3           0   NaN
4 k2    k1           1     1
5 k2    k2           3     1
6 k2    k3           1     1
7 k3    k1           1     1
8 k3    k2           0   NaN
9 k3    k3           3     1

Please, could you help me to spot what I am doing wrong in the code? I try to do it using the dplyr package.

CodePudding user response:

From your first attempt, just add lt_clust alone to summarise():

dataset %>% 
    mutate(clust = as.factor(clust),
           lt_clust = as.factor(lt_clust),
           tick = 1) %>%
    group_by(clust, lt_clust, .drop = FALSE) %>%
    summarise(total = sum(tick)) %>%
    ungroup() %>%
    group_by(clust, ) %>%
        summarise(lt_clust, rate = total / sum(total))

# A tibble: 9 × 3
# Groups:   clust [3]
  clust lt_clust  rate
  <fct> <fct>    <dbl>
1 k1    k1        0.8 
2 k1    k2        0.2 
3 k1    k3        0   
4 k2    k1        0.2 
5 k2    k2        0.6 
6 k2    k3        0.2 
7 k3    k1        0.25
8 k3    k2        0   
9 k3    k3        0.75
  • Related