summarizing distinct character vectors based on strings in another column-CodePudding

My df is

df <- data.frame(region = c(rep("fod", 5), rep("fom", 5), rep("fed", 5)),
                 genus = c('A','A','B','C','C',
                           'B','B','B','A','C',
                           "E","B","B","A","B"),
                 spp = c("A a","A a","B a","C c","C b",
                         "B","B b","B c","A c","C c",
                         "E","B a","B b","A a","B a"))

A fast explanation: each genus is a single name and each spp bears the genus name as the first word and another word (the specific name) as the second. In my case, uppercase letters are the genus and lowercase letters the specific names. Ok.

I'm trying to do the following: for each region, I'm using n_distinct to count the number of unique genus and spp. This can be easily performed via

df <- df %>% group_by(region) %>% 
      summarise(uniq_genus = n_distinct(genus),
                uniq_spp = n_distinct(spp))

However, I have two specific cases:

Example of case1: region FOD, line 6 of df, where the value of genus == spp and B, as a "species genus (first word of spp)", also occurs in line 7 and 8. In cases like this, I do not want B to be computed at spp when using n_distinct — the rationale is that if there are spp with two words already in spp (one of those being B, then B must be one of them, so it is not unique.

Example of case2: region FED, line 11 of df, where the value of genus == spp and E, as a "species genus" does not occur anymore in FED. In cases like this, I want E to be computed when using n_distinct — the rationale is that if there are no spp with two words (one of those being E), then E must be unique.

The second case, I think, can be solved simply by using n_distinct, however, I'm having some issues with case1, that is when the first word (in the example B) would be considered a unique value but it also appears in other lines as B "specific name". How can I add these conditions inside a n_distinct(spp...)?

I'm sorry if this is too much info. I think that what I'm trying to do is much more simple than the amount of text that I wrote.

CodePudding user response：

Perhaps this helps - grouped by 'region', update the spp to NA where the value of 'spp' is equal to 'genus' and the genus values are duplicated in the reverse order, then use n_distinct with na.rm = TRUE

library(dplyr)
df %>% 
  group_by(region) %>%
  mutate(spp = case_when(!(spp == genus & 
         duplicated(genus, fromLast = TRUE)) ~ spp)) %>%    
  summarise(uniq_genus = n_distinct(genus, na.rm = TRUE),
                uniq_spp = n_distinct(spp, na.rm = TRUE))

-output

# A tibble: 3 × 3
  region uniq_genus uniq_spp
  <chr>       <int>    <int>
1 fed             3        4
2 fod             3        4
3 fom             3        4