My df
is
df <- data.frame(region = c(rep("fod", 5), rep("fom", 5), rep("fed", 5)),
genus = c('A','A','B','C','C',
'B','B','B','A','C',
"E","B","B","A","B"),
spp = c("A a","A a","B a","C c","C b",
"B","B b","B c","A c","C c",
"E","B a","B b","A a","B a"))
A fast explanation: each genus
is a single name and each spp
bears the genus
name as the first word and another word (the specific name) as the second. In my case, uppercase letters are the genus
and lowercase letters the specific names. Ok.
I'm trying to do the following: for each region
, I'm using n_distinct
to count the number of unique genus
and spp
. This can be easily performed via
df <- df %>% group_by(region) %>%
summarise(uniq_genus = n_distinct(genus),
uniq_spp = n_distinct(spp))
However, I have two specific cases:
Example of case1: region
FOD
, line 6 of df
, where the value of genus
== spp
and B
, as a "species genus (first word of spp
)", also occurs in line 7 and 8. In cases like this, I do not want B
to be computed at spp
when using n_distinct
— the rationale is that if there are spp
with two words already in spp
(one of those being B
, then B
must be one of them, so it is not unique.
Example of case2: region
FED
, line 11 of df
, where the value of genus
== spp
and E
, as a "species genus" does not occur anymore in FED
. In cases like this, I want E
to be computed when using n_distinct
— the rationale is that if there are no spp
with two words (one of those being E
), then E
must be unique.
The second case, I think, can be solved simply by using n_distinct
, however, I'm having some issues with case1, that is when the first word (in the example B
) would be considered a unique value but it also appears in other lines as B
"specific name". How can I add these conditions inside a n_distinct(spp...)
?
I'm sorry if this is too much info. I think that what I'm trying to do is much more simple than the amount of text that I wrote.
CodePudding user response:
Perhaps this helps - grouped by 'region', update the spp
to NA
where the value of 'spp' is equal to 'genus' and the genus
values are duplicated
in the reverse order, then use n_distinct
with na.rm = TRUE
library(dplyr)
df %>%
group_by(region) %>%
mutate(spp = case_when(!(spp == genus &
duplicated(genus, fromLast = TRUE)) ~ spp)) %>%
summarise(uniq_genus = n_distinct(genus, na.rm = TRUE),
uniq_spp = n_distinct(spp, na.rm = TRUE))
-output
# A tibble: 3 × 3
region uniq_genus uniq_spp
<chr> <int> <int>
1 fed 3 4
2 fod 3 4
3 fom 3 4