Dplyr when two columns match choose largest value in third columns with custom order-CodePudding

I have a df that looks like the following:

ID   CAT   VAL
2     A    I
2     A    II
3     B    I
3     B    I
3     A   Unknown
4     A   III

What I would like to do is if the ID and the CAT match take only the row with the highest VAL using a custom order (Unknown, I,II,III,IV). So the output would look like:

ID   CAT   VAL
2     A    II
3     B    I
3     A   Unknown
4     A   III

CodePudding user response：

Create an ordered factor and then you can take max for each group.

library(dplyr)

custom_order <- c('Unknown','I','II','III','IV')

df %>%
  mutate(VAL = factor(VAL, custom_order, ordered = TRUE)) %>%
  group_by(ID, CAT) %>%
  summarise(VAL = max(VAL)) %>%
  ungroup

#     ID CAT   VAL    
#  <int> <chr> <ord>  
#1     2 A     II     
#2     3 A     Unknown
#3     3 B     I      
#4     4 A     III

Another option using arrange and distinct.

df %>%
  arrange(ID, CAT, desc(match(VAL, custom_order))) %>%
  distinct(ID, CAT, .keep_all = TRUE)

CodePudding user response：

We may use

library(dplyr)
df1 %>%
    arrange(ID, CAT, match(VAL, c('Unknown','I','II','III','IV'))) %>% 
    group_by(ID, CAT) %>% 
    slice_tail(n = 1)
# A tibble: 4 × 3
# Groups:   ID, CAT [4]
     ID CAT   VAL    
  <int> <chr> <chr>  
1     2 A     II     
2     3 A     Unknown
3     3 B     I      
4     4 A     III

data

df1 <- structure(list(ID = c(2L, 2L, 3L, 3L, 3L, 4L), CAT = c("A", "A", 
"B", "B", "A", "A"), VAL = c("I", "II", "I", "I", "Unknown", 
"III")), class = "data.frame", row.names = c(NA, -6L))