How to use dplyr distinct function with multiple data frame variables and when there are ties?-CodePudding

I'm using dplyr distinct() for the first time and I'm trying to figure out how to use it with multiple variables and how to handle "ties". For example, when I run the code shown at the bottom of this post against example data frame label_18, I get the below correct results as shown and explained here (note that there no ties with eleCnt and grpID columns in this example):

  Element Group   eleCnt   grpID grpRnk  Explain grpRnk column...
  <chr>   <dbl>    <int>   <int>  <int>
1 B           2        1       3      1  Ranked 1st since it has lowest eleCnt & lowest grpID
2 R           3        1       6      2  Ranked 2nd since it has lowest elecCnt & 2nd lowest grpID
3 X           4        1      10      3  Same pattern as above
4 R           1        4       9      4  Same pattern as above  
5 R           2        6      13      5  Same pattern as above

Now when I run the code against label_7, there is a tie between eleCnt and grpID, and I get these results:

  Element Group   eleCnt   grpID grpRnk
  <chr>   <dbl>    <int>   <int>  <int>
1 R           1        1       3      1
2 R           2        3       7      2

Expected output: I would like the results for label_7 to be (while retaining the output for label_18 shown above):

  Element Group   eleCnt   grpID grpRnk   Explain grpRnk column...
  <chr>   <dbl>    <int>   <int>  <int>
1 R           1        1       3      1   Ranked 1st since it has lowest eleCnt & lowest grpID
2 X           3        1       3      1   Also ranked 1st since it ties with above
3 R           2        3       7      2   Ranked 2nd since its eleCnt is 2nd and its grpRnk is 2nd

How do I modify distinct() for handling ties, so I can get the desired results for label_7 while keeping the same results for label_18? Maybe there's a better way to do this completely, some function other than distinct() for this sort of thing.

Code:

library(dplyr)

label_7 <- data.frame(Element=c("B","R","R","R","R","B","X","X","X","X","X"),
                      Group = c(0,1,1,2,2,0,3,3,0,0,0),
                      eleCnt = c(1,1,2,3,4,2,1,2,3,4,5),
                      grpID = c(0,3,3,7,7,0,3,3,0,0,0))

label_18 <- data.frame(Element = c("R","R","R","X","X","X","X","B","B","R","R","R","R"),
                       Group = c(3,3,3,4,4,4,4,2,2,1,1,2,2),
                       eleCnt = c(1,2,3,1,2,3,4,1,2,4,5,6,7),
                       grpID = c(6,6,6,10,10,10,10,3,3,9,9,13,13))

label_7 %>% select(Element,Group,eleCnt,grpID) %>% 
  filter(Group > 0) %>% 
  group_by(Element,Group) %>% 
  slice(which.min(Group)) %>% 
  ungroup() %>%
  distinct(eleCnt,grpID, .keep_all = TRUE) %>%
  arrange(eleCnt,grpID) %>%
  mutate(grpRnk = 1:n())

Edit: adding another data frame to test against, label_15 --

> label_15
  Element Group eleCnt grpID
1       B     0      1     0
2       R     1      1     3
3       R     1      2     3
4       R     0      3     0
5       X     2      1     3
6       X     2      2     3
7       X     3      3     7
8       X     3      4     7

Expected results would be similar to label_7, because of a tie between Elements R and X in rows 2 and 5 of the above data frame:

  Element Group eleCnt grpID grpRank
  <chr>   <dbl>  <dbl> <dbl>   <int>
1 R           1      1     3       1
2 X           2      1     3       1
3 X           3      3     7       2

Code for label_15 data frame:

label_15 <- data.frame(Element = c("B","R","R","R","X","X","X","X"),
                       Group = c(0,1,1,0,2,2,3,3),
                       eleCnt = c(1,1,2,3,1,2,3,4),
                       grpID = c(0,3,3,0,3,3,7,7))

CodePudding user response：

We could try

library(dplyr)
library(data.table)
label_7 %>% 
  select(Element,Group,eleCnt,grpID) %>% 
  filter(Group > 0) %>% 
  group_by(Element,Group) %>% 
  slice(which.min(Group)) %>% 
  ungroup() %>% 
  distinct(tmp = rleid(eleCnt, grpID), .keep_all = TRUE) %>%
  arrange(eleCnt,grpID) %>% 
  select(-tmp) %>%
  mutate(grpRank= match(grpID, unique(grpID)))

-output

# A tibble: 3 × 5
  Element Group eleCnt grpID grpRank
  <chr>   <dbl>  <dbl> <dbl>   <int>
1 R           1      1     3       1
2 X           3      1     3       1
3 R           2      3     7       2

For the second case

 label_18 %>%
   select(Element,Group,eleCnt,grpID) %>% 
  filter(Group > 0) %>% 
  group_by(Element,Group) %>% 
  slice(which.min(Group)) %>% 
  ungroup() %>% 
  distinct(tmp = rleid(eleCnt, grpID), .keep_all = TRUE) %>%
  arrange(eleCnt,grpID) %>% 
  select(-tmp) %>%
  mutate(grpRank= match(grpID, unique(grpID)))

-output

# A tibble: 5 × 5
  Element Group eleCnt grpID grpRank
  <chr>   <dbl>  <dbl> <dbl>   <int>
1 B           2      1     3       1
2 R           3      1     6       2
3 X           4      1    10       3
4 R           1      4     9       4
5 R           2      6    13       5