Home > OS >  How to use dplyr distinct function with multiple data frame variables and when there are ties?
How to use dplyr distinct function with multiple data frame variables and when there are ties?

Time:09-27

I'm using dplyr distinct() for the first time and I'm trying to figure out how to use it with multiple variables and how to handle "ties". For example, when I run the code shown at the bottom of this post against example data frame label_18, I get the below correct results as shown and explained here (note that there no ties with eleCnt and grpID columns in this example):

  Element Group   eleCnt   grpID grpRnk  Explain grpRnk column...
  <chr>   <dbl>    <int>   <int>  <int>
1 B           2        1       3      1  Ranked 1st since it has lowest eleCnt & lowest grpID
2 R           3        1       6      2  Ranked 2nd since it has lowest elecCnt & 2nd lowest grpID
3 X           4        1      10      3  Same pattern as above
4 R           1        4       9      4  Same pattern as above  
5 R           2        6      13      5  Same pattern as above

Now when I run the code against label_7, there is a tie between eleCnt and grpID, and I get these results:

  Element Group   eleCnt   grpID grpRnk
  <chr>   <dbl>    <int>   <int>  <int>
1 R           1        1       3      1
2 R           2        3       7      2

Expected output: I would like the results for label_7 to be (while retaining the output for label_18 shown above):

  Element Group   eleCnt   grpID grpRnk   Explain grpRnk column...
  <chr>   <dbl>    <int>   <int>  <int>
1 R           1        1       3      1   Ranked 1st since it has lowest eleCnt & lowest grpID
2 X           3        1       3      1   Also ranked 1st since it ties with above
3 R           2        3       7      2   Ranked 2nd since its eleCnt is 2nd and its grpRnk is 2nd

How do I modify distinct() for handling ties, so I can get the desired results for label_7 while keeping the same results for label_18? Maybe there's a better way to do this completely, some function other than distinct() for this sort of thing.

Code:

library(dplyr)

label_7 <- data.frame(Element=c("B","R","R","R","R","B","X","X","X","X","X"),
                      Group = c(0,1,1,2,2,0,3,3,0,0,0),
                      eleCnt = c(1,1,2,3,4,2,1,2,3,4,5),
                      grpID = c(0,3,3,7,7,0,3,3,0,0,0))

label_18 <- data.frame(Element = c("R","R","R","X","X","X","X","B","B","R","R","R","R"),
                       Group = c(3,3,3,4,4,4,4,2,2,1,1,2,2),
                       eleCnt = c(1,2,3,1,2,3,4,1,2,4,5,6,7),
                       grpID = c(6,6,6,10,10,10,10,3,3,9,9,13,13))

label_7 %>% select(Element,Group,eleCnt,grpID) %>% 
  filter(Group > 0) %>% 
  group_by(Element,Group) %>% 
  slice(which.min(Group)) %>% 
  ungroup() %>%
  distinct(eleCnt,grpID, .keep_all = TRUE) %>%
  arrange(eleCnt,grpID) %>%
  mutate(grpRnk = 1:n())

Edit: adding another data frame to test against, label_15 --

> label_15
  Element Group eleCnt grpID
1       B     0      1     0
2       R     1      1     3
3       R     1      2     3
4       R     0      3     0
5       X     2      1     3
6       X     2      2     3
7       X     3      3     7
8       X     3      4     7

Expected results would be similar to label_7, because of a tie between Elements R and X in rows 2 and 5 of the above data frame:

  Element Group eleCnt grpID grpRank
  <chr>   <dbl>  <dbl> <dbl>   <int>
1 R           1      1     3       1
2 X           2      1     3       1
3 X           3      3     7       2

Code for label_15 data frame:

label_15 <- data.frame(Element = c("B","R","R","R","X","X","X","X"),
                       Group = c(0,1,1,0,2,2,3,3),
                       eleCnt = c(1,1,2,3,1,2,3,4),
                       grpID = c(0,3,3,0,3,3,7,7))

CodePudding user response:

We could try

library(dplyr)
library(data.table)
label_7 %>% 
  select(Element,Group,eleCnt,grpID) %>% 
  filter(Group > 0) %>% 
  group_by(Element,Group) %>% 
  slice(which.min(Group)) %>% 
  ungroup() %>% 
  distinct(tmp = rleid(eleCnt, grpID), .keep_all = TRUE) %>%
  arrange(eleCnt,grpID) %>% 
  select(-tmp) %>%
  mutate(grpRank= match(grpID, unique(grpID)))

-output

# A tibble: 3 × 5
  Element Group eleCnt grpID grpRank
  <chr>   <dbl>  <dbl> <dbl>   <int>
1 R           1      1     3       1
2 X           3      1     3       1
3 R           2      3     7       2

For the second case

 label_18 %>%
   select(Element,Group,eleCnt,grpID) %>% 
  filter(Group > 0) %>% 
  group_by(Element,Group) %>% 
  slice(which.min(Group)) %>% 
  ungroup() %>% 
  distinct(tmp = rleid(eleCnt, grpID), .keep_all = TRUE) %>%
  arrange(eleCnt,grpID) %>% 
  select(-tmp) %>%
  mutate(grpRank= match(grpID, unique(grpID)))

-output

# A tibble: 5 × 5
  Element Group eleCnt grpID grpRank
  <chr>   <dbl>  <dbl> <dbl>   <int>
1 B           2      1     3       1
2 R           3      1     6       2
3 X           4      1    10       3
4 R           1      4     9       4
5 R           2      6    13       5
  • Related