I have the following dataset, where the column clust is the initial cluster and lt_clust is the resulting cluster after some time:
dataset <- data.frame(Id = c(101, 102, 103, 104, 105, 106, 107, 108,
109, 110, 111, 112, 113, 114),
clust = c("k1", "k1", "k1", "k1","k1", "k2", "k2",
"k2", "k2", "k2", "k3", "k3", "k3", "k3"),
lt_clust = c("k2", "k1", "k1", "k1", "k1", "k2", "k3",
"k1", "k2", "k2", "k3", "k3", "k1", "k3"),
stringsAsFactors = FALSE)
Now I want to test how much I was accurate when assigning the final cluster, so the expected result is:
clust lt_clust rate
<fct> <fct> <dbl>
1 k1 k1 0.8
2 k1 k2 0.2
3 k1 k3 0
4 k2 k1 0.2
5 k2 k2 0.6
6 k2 k3 0.2
7 k3 k1 0.25
8 k3 k2 0
9 k3 k3 0.75
This was my first attempt:
dataset %>%
mutate(clust = as.factor(clust),
lt_clust = as.factor(lt_clust),
tick = 1) %>%
group_by(clust, lt_clust, .drop = FALSE) %>%
summarise(total = sum(tick)) %>%
ungroup() %>%
group_by(clust, ) %>%
summarise(rate = total / sum(total))
But I fail to capture the lt_clust column:
clust rate
<fct> <dbl>
1 k1 0.8
2 k1 0.2
3 k1 0
4 k2 0.2
5 k2 0.6
6 k2 0.2
7 k3 0.25
8 k3 0
9 k3 0.75
And when I try this, the result is wrong too:
dataset %>%
mutate(clust = as.factor(clust),
lt_clust = as.factor(lt_clust),
tick = 1) %>%
group_by(clust, lt_clust, .drop = FALSE) %>%
summarise(total = sum(tick),
rate = total / sum(total))
clust lt_clust total rate
<fct> <fct> <dbl> <dbl>
1 k1 k1 4 1
2 k1 k2 1 1
3 k1 k3 0 NaN
4 k2 k1 1 1
5 k2 k2 3 1
6 k2 k3 1 1
7 k3 k1 1 1
8 k3 k2 0 NaN
9 k3 k3 3 1
Please, could you help me to spot what I am doing wrong in the code? I try to do it using the dplyr package.
CodePudding user response:
From your first attempt, just add lt_clust
alone to summarise()
:
dataset %>%
mutate(clust = as.factor(clust),
lt_clust = as.factor(lt_clust),
tick = 1) %>%
group_by(clust, lt_clust, .drop = FALSE) %>%
summarise(total = sum(tick)) %>%
ungroup() %>%
group_by(clust, ) %>%
summarise(lt_clust, rate = total / sum(total))
# A tibble: 9 × 3
# Groups: clust [3]
clust lt_clust rate
<fct> <fct> <dbl>
1 k1 k1 0.8
2 k1 k2 0.2
3 k1 k3 0
4 k2 k1 0.2
5 k2 k2 0.6
6 k2 k3 0.2
7 k3 k1 0.25
8 k3 k2 0
9 k3 k3 0.75