I have some results cluster labels from kmeans done on different ids (reprex example below). the problem is the kmeans clusters codes are not ordered consistently across ids although all ids have 3 clusters.
reprex = data.frame(id = rep(1:2, each = 41,
v1 = rep(seq(1:4), 2),
cluster = c(2,2,1,3,3,1,2,2))
reprex
id v1 cluster
1 1 1 2
2 1 2 2
3 1 3 1
4 1 4 3
5 2 1 3
6 2 2 1
7 2 3 2
8 2 4 2
what I want is that the variable cluster should always start with 1 within each ID. Note I don't want to reorder that dataframe by cluster, the order needs to remain the same. so the desired result would be:
reprex_desired<- data.frame(id = rep(1:2, each = 4),
v1 = rep(seq(1:4), 2),
cluster = c(2,2,1,3,3,1,2,2),
what_iWant = c(1,1,2,3,1,2,3,3))
reprex_desired
id v1 cluster what_iWant
1 1 1 2 1
2 1 2 2 1
3 1 3 1 2
4 1 4 3 3
5 2 1 3 1
6 2 2 1 2
7 2 3 2 3
8 2 4 2 3
any pointers are very much appreciated
CodePudding user response:
We can use match
after grouping by 'id'
library(dplyr)
reprex <- reprex %>%
group_by(id) %>%
mutate(what_IWant = match(cluster, unique(cluster))) %>%
ungroup
-output
reprex
# A tibble: 8 × 4
id v1 cluster what_IWant
<int> <int> <dbl> <int>
1 1 1 2 1
2 1 2 2 1
3 1 3 1 2
4 1 4 3 3
5 2 1 3 1
6 2 2 1 2
7 2 3 2 3
8 2 4 2 3
CodePudding user response:
Here is a version with cumsum
combined with lag
:
library(dplyr)
df %>%
group_by(id) %>%
mutate(what_i_want = cumsum(cluster != lag(cluster, def = first(cluster))) 1)
id v1 cluster what_i_want
<int> <int> <dbl> <dbl>
1 1 1 2 1
2 1 2 2 1
3 1 3 1 2
4 1 4 3 3
5 2 1 3 1
6 2 2 1 2
7 2 3 2 3
8 2 4 2 3