I am working on 16s data and try to format an OTU table to upload it to a different tool.
but it is supposed to look like that
So I need R to count the number of semicolons ";" in each cell in column "taxonomy" and if the number is smaller than 6 I need R to add the required number of semincolons to make it six semicolons per cell. I am new to the bioinformatics field so any help would be much appreciated!
I tried
ifelse(str_count(ASV$taxonomy, ";") >= 6, ASV$taxonomy, paste0(ASV$taxonomy, " ;"))
but I don´t know how I can tell R to add so many semicolons that it makes 6 semicolons in each cell.
Thank you in advance, Lea
CodePudding user response:
Since I don't have your dataset, I just made an example dataframe.
Next time when you ask, make sure you don't post image of codes/dataset, you should use dput(your_data)
and paste the result in the question.
Input
library(tidyverse)
df <- tibble(Name = LETTERS[1:4],
OTU = c("A;", "A; B;", "A; B; C; D; E; F;", "A; B; C; D; E;"))
df
# A tibble: 4 x 2
Name OTU
<chr> <chr>
1 A A;
2 B A; B;
3 C A; B; C; D; E; F;
4 D A; B; C; D; E;
Code and output
First count the number of ";" in the column, if it's fewer than 6, add some more at the end of the string in OTU
. If it is not fewer than 6, use the original OTU
value.
This will entirely replace the original OTU
column.
df %>% mutate(OTU = ifelse(str_count(OTU, ";") < 6,
paste(OTU, str_dup("; ", 6 - str_count(OTU, ";"))),
OTU))
# A tibble: 4 x 2
Name OTU
<chr> <chr>
1 A "A; ; ; ; ; ; "
2 B "A; B; ; ; ; ; "
3 C "A; B; C; D; E; F;"
4 D "A; B; C; D; E; ; "
CodePudding user response:
We could use separate
with the fill
argument from tidyr
package and
then paste them all together and finally replace NA
by ""
library(tidyverse)
df %>%
separate(col1, c("a","b","c","d","e","f","g"), fill = "right", sep = ";") %>%
mutate(col1 = paste(a,b,c,d,e,f,g, sep = "; "), .keep="unused") %>%
mutate(col1 = str_replace_all(col1, "NA", ""))
col1
<chr>
1 "Bacteria; Proteobacteria; Gammaproteobacteria; ; ; ; "
2 "Bacteria; Proteobacteria; Gammaproteobacteria; ; ; ; "
3 "Bacteria; Actinobacteria; Rubrobacteria; Rubrobacterales; Rubrobacteraceae; Rubrobacter; "
4 "Bacteria; Gemmatimonadetes; Gemm-1; ; ; ; "
5 "Bacteria; Proteobacteria; Gammaproteobacteria; Chromatiales; ; ; "
6 "Bacteria; Actinobacteria; Nitriliruptoria; Nitriliruptorales; Nitriliruptoraceae; ; "
7 "Bacteria; Proteobacteria; Gammaproteobacteria; Chromatiales; ; ; "
8 "Bacteria; Actinobacteria; Thermoleophilia; Solirubrobacterales; ; ; "
9 "Bacteria; Proteobacteria; Gammaproteobacteria; Chromatiales; ; ; "
10 "Bacteria; Proteobacteria; Alphaproteobacteria; Sphingomonadales; Sphingomonadaceae; Kaistobacter; "
11 "Bacteria; Actinobacteria; Thermoleophilia; Solirubrobacterales; ; ; "
12 "Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales; Oxalobacteraceae; Ralstonia; "
13 "Bacteria; Actinobacteria; Actinobacteria; Actinomycetales; Pseudonocardiaceae; ; "
14 "Bacteria; Actinobacteria; Actinobacteria; Actinomycetales; Micrococcaceae; Arthrobacter; "
data:
structure(list(col1 = c("Bacteria; Proteobacteria; Gammaproteobacteria;",
"Bacteria; Proteobacteria; Gammaproteobacteria;", "Bacteria; Actinobacteria; Rubrobacteria; Rubrobacterales; Rubrobacteraceae; Rubrobacter;",
"Bacteria; Gemmatimonadetes; Gemm-1;", "Bacteria; Proteobacteria; Gammaproteobacteria; Chromatiales;",
"Bacteria; Actinobacteria; Nitriliruptoria; Nitriliruptorales; Nitriliruptoraceae;",
"Bacteria; Proteobacteria; Gammaproteobacteria; Chromatiales;",
"Bacteria; Actinobacteria; Thermoleophilia; Solirubrobacterales;",
"Bacteria; Proteobacteria; Gammaproteobacteria; Chromatiales;",
"Bacteria; Proteobacteria; Alphaproteobacteria; Sphingomonadales; Sphingomonadaceae; Kaistobacter;",
"Bacteria; Actinobacteria; Thermoleophilia; Solirubrobacterales;",
"Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales; Oxalobacteraceae; Ralstonia;",
"Bacteria; Actinobacteria; Actinobacteria; Actinomycetales; Pseudonocardiaceae;",
"Bacteria; Actinobacteria; Actinobacteria; Actinomycetales; Micrococcaceae; Arthrobacter;"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-14L))