I have some data that looks like this:
id | ethnicity |
---|---|
1 | white |
2 | south asian |
2 | other |
3 | other |
4 | white |
4 | south asian |
as seen above there is potential for an id to have two ethnicity values. How would I go about removing these 'other' rows if that id already has an entry such as "white" or "south asian" while keeping the "white" or "south asian" entry?
I have noticed there are entries which also have south asian along with a white entry
My priority would be South Asian > White > Other in terms of keeping rows
So an expected output would be
id | ethnicity |
---|---|
1 | white |
2 | south asian |
3 | other |
4 | south asian |
CodePudding user response:
If the intention is to get the prioritized 'ethnicity' per 'id', convert the column 'ethnicity' to ordered
with levels
specified in the order of preference, then do a group by 'id' and filter
the first
available level in that order
library(dplyr)
df2 %>%
mutate(ethnicity = ordered(ethnicity,
c( "south asian", "white", "other"))) %>%
group_by(id) %>%
filter(ethnicity %in% first(levels(droplevels(ethnicity)))) %>%
ungroup
-output
# A tibble: 4 × 2
id ethnicity
<int> <ord>
1 1 white
2 2 south asian
3 3 other
4 4 south asian
data
df2 <- structure(list(id = c(1L, 2L, 2L, 3L, 4L, 4L), ethnicity = c("white",
"south asian", "other", "other", "white", "south asian")),
class = "data.frame", row.names = c(NA,
-6L))