How to a row in a dataframe based on certain conditions-CodePudding

I have some data that looks like this:

id	ethnicity
1	white
2	south asian
2	other
3	other
4	white
4	south asian

as seen above there is potential for an id to have two ethnicity values. How would I go about removing these 'other' rows if that id already has an entry such as "white" or "south asian" while keeping the "white" or "south asian" entry?

I have noticed there are entries which also have south asian along with a white entry

My priority would be South Asian > White > Other in terms of keeping rows

So an expected output would be

id	ethnicity
1	white
2	south asian
3	other
4	south asian

CodePudding user response：

If the intention is to get the prioritized 'ethnicity' per 'id', convert the column 'ethnicity' to ordered with levels specified in the order of preference, then do a group by 'id' and filter the first available level in that order

library(dplyr)
df2 %>%
  mutate(ethnicity = ordered(ethnicity, 
    c( "south asian", "white", "other"))) %>%
  group_by(id) %>% 
  filter(ethnicity %in% first(levels(droplevels(ethnicity)))) %>%
  ungroup

-output

# A tibble: 4 × 2
     id ethnicity  
  <int> <ord>      
1     1 white      
2     2 south asian
3     3 other      
4     4 south asian

data

df2 <- structure(list(id = c(1L, 2L, 2L, 3L, 4L, 4L), ethnicity = c("white", 
"south asian", "other", "other", "white", "south asian")),
 class = "data.frame", row.names = c(NA, 
-6L))