Home > Blockchain >  create new order for existing column values without reordering rows in dataframe - R
create new order for existing column values without reordering rows in dataframe - R

Time:06-10

I have some results cluster labels from kmeans done on different ids (reprex example below). the problem is the kmeans clusters codes are not ordered consistently across ids although all ids have 3 clusters.

reprex = data.frame(id = rep(1:2, each = 41, 
           v1 = rep(seq(1:4), 2),
           cluster = c(2,2,1,3,3,1,2,2))

reprex
   id v1 cluster
1  1  1       2
2  1  2       2
3  1  3       1
4  1  4       3
5  2  1       3
6  2  2       1
7  2  3       2
8  2  4       2

what I want is that the variable cluster should always start with 1 within each ID. Note I don't want to reorder that dataframe by cluster, the order needs to remain the same. so the desired result would be:

reprex_desired<- data.frame(id = rep(1:2, each = 4), 
           v1 = rep(seq(1:4), 2),
           cluster = c(2,2,1,3,3,1,2,2),
           what_iWant = c(1,1,2,3,1,2,3,3))

reprex_desired
  id v1 cluster what_iWant
1  1  1       2          1
2  1  2       2          1
3  1  3       1          2
4  1  4       3          3
5  2  1       3          1
6  2  2       1          2
7  2  3       2          3
8  2  4       2          3

any pointers are very much appreciated

CodePudding user response:

We can use match after grouping by 'id'

library(dplyr)
reprex <- reprex %>%
     group_by(id) %>% 
     mutate(what_IWant = match(cluster, unique(cluster))) %>%
     ungroup

-output

reprex
# A tibble: 8 × 4
     id    v1 cluster what_IWant
  <int> <int>   <dbl>      <int>
1     1     1       2          1
2     1     2       2          1
3     1     3       1          2
4     1     4       3          3
5     2     1       3          1
6     2     2       1          2
7     2     3       2          3
8     2     4       2          3

CodePudding user response:

Here is a version with cumsum combined with lag:

library(dplyr)
df %>% 
  group_by(id) %>% 
  mutate(what_i_want = cumsum(cluster != lag(cluster, def = first(cluster))) 1)
     id    v1 cluster what_i_want
  <int> <int>   <dbl>       <dbl>
1     1     1       2           1
2     1     2       2           1
3     1     3       1           2
4     1     4       3           3
5     2     1       3           1
6     2     2       1           2
7     2     3       2           3
8     2     4       2           3
  • Related