(R) Filter rows based on string names if the only resulting match in another column is NA-CodePudding

The title may sound kinda weird but I have found no way of better defining my issue.

Here an example data set:

test = data.frame(genus = c("Acicarpha", "Acicarpha", "Acicarpha", "Acicarpha", "Acisanthera", "Acisanthera", "Acisanthera", "Acisanthera", "Acmella", "Acmella"), sp1 = c("NA", "bonariensis", "bonariensis", "spathulata", NA, "variabilis", "variabilis", "variabilis", NA, NA))

As you can see, I have a few species names (genus sp1) possible: Acicarpha NA, Acicarpha bonariensis, Acicarpha spathulata, Acisanthera variabilis, Acisanthera NA, and Acmella NA.

Here's the deal: I'm trying to select only the row related to Acmella NA since the only returning value on the sp1 column is NA. Other species also have NA, but they do not have only NA. How can I do this? I'm bashing my head.

CodePudding user response：

Here's some code that does what I think you're asking for. It has four steps:

Group the rows by genus.
Make a new column called all_sp1_na that is TRUE if all of each genus's sp1 observations are NA, FALSE otherwise (i.e. FALSE if at least one sp1 observation is not NA for that genus).
Filter for rows where all_sp1_na is true.
Remove the temporary column all_sp1_na.

library(tidyverse)

test %>%
  group_by(genus) %>%
  mutate(all_sp1_na = all(is.na(sp1))) %>%
  filter(all_sp1_na) %>%
  select(-all_sp1_na)

And it gives this result:

# A tibble: 2 x 2
# Groups:   genus [1]
  genus   sp1  
  <chr>   <chr>
1 Acmella NA   
2 Acmella NA

Let me know if you're looking for something else.