Ignore some things using str_starts, with n_distinct() in tidyverse-CodePudding

Hello I am looking to count number of unique species per sample. However I don't want to count unknowns unless they are definitely not one of the species present in the sample.

lets use this example data set:

test <- data.frame(sample = c(rep(1, 4), 
                             rep(2, 4), 
                             rep(3, 4), 
                             rep(4, 4), 
                             rep(5, 1)), 
                   species = c(c("Species_a", "Species_b", "Species_c", "Species_c"), 
                              c("Species_a", "Species_b", "Species_c", "Species_d"),
                              c("Species_a", "Species_b", "Species_b", "Species_b"),
                              c("Species_a", "Species_b", "Species_c", "Unknown_b_or_d"), 
                              c("Unknown_a_or_c")))

I would ideally like to end up with a data frame that looks like this:

  sample nspp
1      1    3
2      2    4
3      3    2
4      4    3
5      5    1

I could of course go the long way around and use the filter function, summarize(n_distinct) and mutate to some math, but I was hoping I could cut out a step or two by using starts_with to my advantage here. Is there a way to string str_starts with n_distinct?

Like so:

test %>%
  group_by(thing) %>%
  summarize(nspp_unknown = n_distinct(str_starts(common_name, pattern =  "Unknown")))

Except... you know with syntax that works?

Thanks very much for your help!

Shout out to this question, that has almost the same problem (sans the "str_starts" issue) and a very elegant solution. I have shamelessly plagiarized their example: Ignore some things with n_distinct() in dplyr

*** Edit one ***

this question was firstly centered around Starts_with which was quickly pointed out is only for columns. So I have edited to see if str_starts will do the job

*** Edit two ***

For clarification I am looking to avoid double counting species when I calculate number of unique species per sample. I have some unknowns that could potentially just be species identified in the sample but were unable to be identified down to species. I also have some unknowns that, while they could not be identified down to species, are not mistakable for species in the sample.

So in the below example :

   sample        species
1       1      Species_a
2       1      Species_b
3       1      Species_c
4       1      Species_c
5       2      Species_a
6       2      Species_b
7       2      Species_c
8       2      Species_d
9       3      Species_a
10      3      Species_b
11      3      Species_b
12      3      Species_b
13      4      Species_a
14      4      Species_b
15      4      Species_c
16      4 Unknown_b_or_d
17      5 Unknown_a_or_c
18      5      Species_b

I would want to include the unknown in sample 5 since I can be confident I am not double counting, but the unknown in sample four must be excluded as it could be a double count of Species B.

Hope that clears it up!

CodePudding user response：

A few pre-processing steps are needed to get to your desired result but I think this works:

library(dplyr)
library(tibble)
library(tidyr)

test %>%
  rowid_to_column() %>%
  separate(species, into = c("species", "type1", "type2"), sep = "_(or_)?", fill = "right") %>%
  pivot_longer(-(rowid:species)) %>%
  filter(!is.na(value)) %>%
  group_by(value, sample) %>%
  mutate(dupe = duplicated(value)) %>%
  group_by(rowid) %>%
  filter(!any(dupe), row_number() == 1) %>%
  ungroup() %>%
  count(sample)

# A tibble: 5 × 2
  sample     n
   <dbl> <int>
1      1     3
2      2     4
3      3     2
4      4     3
5      5     1