I am working with microbial data where every taxa level (i.e. Kingdom, Phylum, Class, Order, Family, Genus and Species) are separated by ;
.
Example input data:
input_data <- data.frame(taxon = c("k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus;s__crispatus", "k__Bacteria;p__Firmicutes;c__Tissierellia;o__Tissierellales;f__Peptoniphilaceae;g__Anaerococcus;s__", "k__Bacteria;p__Proteobacteria;c__Betap__Proteobacteria;o__Burkholderiales;f__Comamonadaceae", "k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae", "k__Bacteria;p__Proteobacteria;c__Alphap__Proteobacteria;o__Rhizobiales;f__Bradyrhizobiaceae;g__Bosea;s__massiliensis", "k__Bacteria;p__Proteobacteria;c__Gammap__Proteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__baumannii", "k__Bacteria;p__Proteobacteria;c__Betap__Proteobacteria;o__Nitrosomonadales;f__Methylophilaceae"))
print(input_data)
The input data will look like this:
taxon
1 k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus;s__crispatus
2 k__Bacteria;p__Firmicutes;c__Tissierellia;o__Tissierellales;f__Peptoniphilaceae;g__Anaerococcus;s__
3 k__Bacteria;p__Proteobacteria;c__Betap__Proteobacteria;o__Burkholderiales;f__Comamonadaceae
4 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae
5 k__Bacteria;p__Proteobacteria;c__Alphap__Proteobacteria;o__Rhizobiales;f__Bradyrhizobiaceae;g__Bosea;s__massiliensis
6 k__Bacteria;p__Proteobacteria;c__Gammap__Proteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__baumannii
7 k__Bacteria;p__Proteobacteria;c__Betap__Proteobacteria;o__Nitrosomonadales;f__Methylophilaceae
I would like to create a new column taxon_main
that is based on the highest available taxa level that looks like the example output data below.
Example output data:
output_data <- data.frame(
taxon = c("k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus;s__crispatus",
"k__Bacteria;p__Firmicutes;c__Tissierellia;o__Tissierellales;f__Peptoniphilaceae;g__Anaerococcus;s__",
"k__Bacteria;p__Proteobacteria;c__Betap__Proteobacteria;o__Burkholderiales;f__Comamonadaceae", "k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae", "k__Bacteria;p__Proteobacteria;c__Alphap__Proteobacteria;o__Rhizobiales;f__Bradyrhizobiaceae;g__Bosea;s__massiliensis",
"k__Bacteria;p__Proteobacteria;c__Gammap__Proteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__baumannii",
"k__Bacteria;p__Proteobacteria;c__Betap__Proteobacteria;o__Nitrosomonadales;f__Methylophilaceae"),
taxon_main = c("Lactobacillus_crispatus", "g_Anaerococcus", "f_Comamonadaceae",
"f_Lachnospiraceae", "Bosea_massiliensis", "Acinetobacter_baumannii",
"f_Methylophilaceae"))
print(output_data)
The output data will have a new column taxon_main
that should look like this:
taxon_main
1 Lactobacillus_crispatus
2 g_Anaerococcus
3 f_Comamonadaceae
4 f_Lachnospiraceae
5 Bosea_massiliensis
6 Acinetobacter_baumannii
7 f_Methylophilaceae
Since I cannot just create the new column based on extracting the taxa after the last separator using dplyr::mutate(taxon_main = str_extract(taxon, "[^;] $"))
because I need to new column to be based on following criteria:
- If genus
g__*
and speciess__*
are available, then concatenate to becomegenus_species
. - If genus and/or species not available, then use next highest level - e.g.
o_Order
. - Only keep a single underscore
_
as a separator for the new column rather than double underscores__
.
The functions I have tried using are separate(taxon, sep = ";")
, mutate(Species = na_if(Species, "s__")
and paste(input_data$Genus, input_data$Species, sep = "_")
. However, I can't seem to make this applicable to all conditions required to get the desired output.
If anyone has any ideas of how best to wrangle the data to become what I am hoping to do, that would be great. Thank you!
CodePudding user response:
There is probably a more efficient way to do this, but we could do a series of ifelse
statements using case_when
from tidyverse
. First, I remove any rows that just end with ;s__
. Then, in the series of statements, I check to if a given taxonomic level is present, then if so, then return that in the desired format. Then, that is repeated across all taxonomic levels.
library(tidyverse)
output <- input_data %>%
mutate(taxon = trimws(taxon, whitespace = ";s__")) %>%
mutate(taxon_main = case_when(str_detect(taxon, "s__") ~ trimws(str_replace_all(str_extract(taxon, "(?<=g__).*"), ";s_", ""), whitespace = '_'),
!str_detect(taxon, "s__") & str_detect(taxon, "g__")~ str_replace_all(str_extract(taxon, "g__.*"), "__", "_"),
!str_detect(taxon, "g__") & str_detect(taxon, "f__") ~ str_replace_all(str_extract(taxon, "f__.*"), "__", "_"),
!str_detect(taxon, "f__") & str_detect(taxon, "o__")~ str_replace_all(str_extract(taxon, "o__.*"), "__", "_"),
!str_detect(taxon, "o__") & str_detect(taxon, "c__")~ str_replace_all(str_extract(taxon, "c__.*"), "__", "_"),
!str_detect(taxon, "c__") & str_detect(taxon, "p__")~ str_replace_all(str_extract(taxon, "p__.*"), "__", "_"),
!str_detect(taxon, "p__") & str_detect(taxon, "k__")~ str_replace_all(str_extract(taxon, "k__.*"), "__", "_"),
TRUE ~ NA_character_))
Output
output %>% select(taxon_main)
taxon_main
1 Lactobacillus_crispatus
2 g_Anaerococcus
3 f_Comamonadaceae
4 f_Lachnospiraceae
5 Bosea_massiliensis
6 Acinetobacter_baumannii
7 f_Methylophilaceae
Or you could also use separate
first, which will make the code less reliant on using a lot of stringr
. We can clean up before using separate
, such as only having one underscore and remove extra s__
. Then, we can go through the ifelse
statements, and then we can bind back to the original taxon
column and drop all the other columns, except for taxon_main
.
input_data %>%
mutate(taxon = trimws(taxon, whitespace = ";s__"),
taxon = str_replace_all(taxon, ";s__", ";"),
taxon = str_replace_all(taxon, "__", "_")) %>%
separate(taxon, sep = ";", into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")) %>%
mutate(taxon_main = case_when(!is.na(Species) ~ paste(str_extract(Genus, "(?<=g_).*"), Species, sep = "_"),
is.na(Species) & !is.na(Genus) ~ Genus,
is.na(Genus) & !is.na(Family) ~ Family,
is.na(Family) & !is.na(Order) ~ Order,
is.na(Order) & !is.na(Class) ~ Class,
is.na(Class) & !is.na(Phylum) ~ Phylum,
is.na(Phylum) & !is.na(Kingdom) ~ Kingdom
)) %>%
bind_cols(input_data,.) %>%
select(taxon_main, taxon)
Output
taxon_main taxon
1 Lactobacillus_crispatus k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus;s__crispatus
2 g_Anaerococcus k__Bacteria;p__Firmicutes;c__Tissierellia;o__Tissierellales;f__Peptoniphilaceae;g__Anaerococcus;s__
3 f_Comamonadaceae k__Bacteria;p__Proteobacteria;c__Betap__Proteobacteria;o__Burkholderiales;f__Comamonadaceae
4 f_Lachnospiraceae k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae
5 Bosea_massiliensis k__Bacteria;p__Proteobacteria;c__Alphap__Proteobacteria;o__Rhizobiales;f__Bradyrhizobiaceae;g__Bosea;s__massiliensis
6 Acinetobacter_baumannii k__Bacteria;p__Proteobacteria;c__Gammap__Proteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__baumannii
7 f_Methylophilaceae k__Bacteria;p__Proteobacteria;c__Betap__Proteobacteria;o__Nitrosomonadales;f__Methylophilaceae