Home > Back-end >  How to keep ties when using dplyr distinct function?
How to keep ties when using dplyr distinct function?

Time:09-26

I'm using dplyr distinct() with multiple variables and am trying to figure out how to handle "ties". For example, when running the code at the bottom of this post against example data frame label_1, I'd like to get these results in situations like this where there's a tie with eleCnt and grpID variables:

  Element Group   eleCnt   grpID grpRnk  Explain grpRnk column...
  <chr>   <dbl>    <int>   <int>  <int>
1 R           1        1       3      1  Ranked 1st since it has lowest eleCnt & lowest grpID
2 X           3        1       3      1  Also ranked 1st since it ties with above in terms of eleCnt and grpID
3 R           2        3       7      2  Ranked 2nd since its eleCnt is 2nd and its grpRnk is 2nd

When I run the code against data frame label_2, there are no ties and the code gives me this correct output:

  Element Group eleCnt grpID grpRnk 
  <chr>   <dbl>  <dbl> <dbl>  <int>
1 B           2      1     3      1
2 R           3      1     6      2
3 X           4      1    10      3
4 R           1      4     9      4
5 R           2      6    13      5

Any recommendations for an efficient way to do this, preferably in dplyr? Maybe distinct() isn't the right function to be using?

Code:

library(dplyr)

label_1 <- data.frame(Element=c("B","R","R","R","R","B","X","X","X","X","X"),
                      Group = c(0,1,1,2,2,0,3,3,0,0,0),
                      eleCnt = c(1,1,2,3,4,2,1,2,3,4,5),
                      grpID = c(0,3,3,7,7,0,3,3,0,0,0))

label_2 <- data.frame(Element = c("R","R","R","X","X","X","X","B","B","R","R","R","R"),
                       Group = c(3,3,3,4,4,4,4,2,2,1,1,2,2),
                       eleCnt = c(1,2,3,1,2,3,4,1,2,4,5,6,7),
                       grpID = c(6,6,6,10,10,10,10,3,3,9,9,13,13))

label_2 %>% select(Element,Group,eleCnt,grpID) %>% 
  filter(Group > 0) %>% 
  group_by(Element,Group) %>% 
  slice(which.min(Group)) %>% 
  ungroup() %>%
  distinct(eleCnt,grpID, .keep_all = TRUE) %>%
  arrange(eleCnt,grpID) %>%
  mutate(grpRnk = 1:n()) 

CodePudding user response:

Perhaps you can leverage data.table::rleid() function, like this:

f <- function(lab) {
  filter(lab,Group!=0) %>% 
    arrange(eleCnt,grpID) %>% 
    mutate(grpRnk = data.table::rleid(eleCnt,grpID)) %>% 
    group_by(grpID) %>% 
    filter(grpRnk==min(grpRnk))
}

Apply f() to label_1

f(label_1)

# A tibble: 3 x 5
# Groups:   grpID [2]
  Element Group eleCnt grpID grpRnk
  <chr>   <dbl>  <dbl> <dbl>  <int>
1 R           1      1     3      1
2 X           3      1     3      1
3 R           2      3     7      3

Apply f() to label_1

f(label_2)

# A tibble: 5 x 5
# Groups:   grpID [5]
  Element Group eleCnt grpID grpRnk
  <chr>   <dbl>  <dbl> <dbl>  <int>
1 B           2      1     3      1
2 R           3      1     6      2
3 X           4      1    10      3
4 R           1      4     9      9
5 R           2      6    13     12
  • Related