Hello I am looking to count number of unique species per sample. However I don't want to count unknowns unless they are definitely not one of the species present in the sample.
lets use this example data set:
test <- data.frame(sample = c(rep(1, 4),
rep(2, 4),
rep(3, 4),
rep(4, 4),
rep(5, 1)),
species = c(c("Species_a", "Species_b", "Species_c", "Species_c"),
c("Species_a", "Species_b", "Species_c", "Species_d"),
c("Species_a", "Species_b", "Species_b", "Species_b"),
c("Species_a", "Species_b", "Species_c", "Unknown_b_or_d"),
c("Unknown_a_or_c")))
I would ideally like to end up with a data frame that looks like this:
sample nspp
1 1 3
2 2 4
3 3 2
4 4 3
5 5 1
I could of course go the long way around and use the filter function, summarize(n_distinct) and mutate to some math, but I was hoping I could cut out a step or two by using starts_with to my advantage here. Is there a way to string str_starts with n_distinct?
Like so:
test %>%
group_by(thing) %>%
summarize(nspp_unknown = n_distinct(str_starts(common_name, pattern = "Unknown")))
Except... you know with syntax that works?
Thanks very much for your help!
Shout out to this question, that has almost the same problem (sans the "str_starts" issue) and a very elegant solution. I have shamelessly plagiarized their example: Ignore some things with n_distinct() in dplyr
*** Edit one ***
this question was firstly centered around Starts_with which was quickly pointed out is only for columns. So I have edited to see if str_starts will do the job
*** Edit two ***
For clarification I am looking to avoid double counting species when I calculate number of unique species per sample. I have some unknowns that could potentially just be species identified in the sample but were unable to be identified down to species. I also have some unknowns that, while they could not be identified down to species, are not mistakable for species in the sample.
So in the below example :
sample species
1 1 Species_a
2 1 Species_b
3 1 Species_c
4 1 Species_c
5 2 Species_a
6 2 Species_b
7 2 Species_c
8 2 Species_d
9 3 Species_a
10 3 Species_b
11 3 Species_b
12 3 Species_b
13 4 Species_a
14 4 Species_b
15 4 Species_c
16 4 Unknown_b_or_d
17 5 Unknown_a_or_c
18 5 Species_b
I would want to include the unknown in sample 5 since I can be confident I am not double counting, but the unknown in sample four must be excluded as it could be a double count of Species B.
Hope that clears it up!
CodePudding user response:
A few pre-processing steps are needed to get to your desired result but I think this works:
library(dplyr)
library(tibble)
library(tidyr)
test %>%
rowid_to_column() %>%
separate(species, into = c("species", "type1", "type2"), sep = "_(or_)?", fill = "right") %>%
pivot_longer(-(rowid:species)) %>%
filter(!is.na(value)) %>%
group_by(value, sample) %>%
mutate(dupe = duplicated(value)) %>%
group_by(rowid) %>%
filter(!any(dupe), row_number() == 1) %>%
ungroup() %>%
count(sample)
# A tibble: 5 × 2
sample n
<dbl> <int>
1 1 3
2 2 4
3 3 2
4 4 3
5 5 1