I am working on a dataframe of plant scientific names a sample of which is as follows:
plantlist <- data.frame(ID = c(1,2,2,2,2,2,2),
SciName = c("Alkanna tuberculata", "Alkanna tuberculata", "Anchusa tinctoria", "Anchusa tinctoria", "Anchusa tinctoria", "Anchusa tinctoria", "Echium italicum"),
SciName.w.author = c("Alkanna tuberculata Greuter", "Alkanna tuberculata Meikle", "Anchusa tinctoria L", "Anchusa tinctoria Woodv", "Anchusa tinctoria Pall", "Anchusa tinctoria Meikle", "Echium italicum"),
Status = c("Unresolved", "Misapplied", "Accepted", "Synonym", "Unresolved", "Synonym", "Misapplied"))
What I need to do is to group the columns by ID
, and SciName
and then keep the following rows:
- if there is only one row in the group keep it, no matter what the status is
- if there are more than two rows keep the accepted and synonyms
- if there are no accepted and synonyms keep unresolved and if no unresolved keep missapplied
I tried to accomplish this using case_when and grouping but I'm stuck in the last part
keep.plantlist <- plantlist %>%
group_by(ID, SciName) %>%
mutate(count = n()) %>%
ungroup()%>%
mutate(keep = case_when(count == 1 ~ T ,
count > 1 & STATUS == "Accepted" ~ T,
count > 1 & STATUS == "Synonym" ~ T))
#expected keep row
plantlist$keep <- c(T, F, T, T, F, T, T)
I also tried mutating status as factor and arranging the groups by the priority I need, but I don't know if there is any function that could help if I have that order.
CodePudding user response:
I think this will work, but need a higher quality test-set to be sure.
keep.plantlist <- plantlist %>%
group_by(ID, SciName) %>%
mutate(count = n()) %>%
mutate(keep = case_when(
count == 1 ~ T ,
count > 1 & STATUS == "Accepted" ~ T,
count > 1 & STATUS == "Synonym" ~ T,
!any(STATUS %in% c("Accepted", "Synonym")) &
STATUS %in% "Unresolved" ~ TRUE,
!any(STATUS %in% c("Accepted", "Synonym", "Unresolved")) &
STATUS %in% "Misapplied" ~ TRUE,
TRUE ~ FALSE
))