Count character in specific cell and add that character a specified amount of times-CodePudding

I am working on 16s data and try to format an OTU table to upload it to a different tool.

So I need R to count the number of semicolons ";" in each cell in column "taxonomy" and if the number is smaller than 6 I need R to add the required number of semincolons to make it six semicolons per cell. I am new to the bioinformatics field so any help would be much appreciated!

I tried

ifelse(str_count(ASV$taxonomy, ";") >= 6, ASV$taxonomy, paste0(ASV$taxonomy, " ;"))

but I don´t know how I can tell R to add so many semicolons that it makes 6 semicolons in each cell.

Thank you in advance, Lea

CodePudding user response：

Since I don't have your dataset, I just made an example dataframe.

Next time when you ask, make sure you don't post image of codes/dataset, you should use dput(your_data) and paste the result in the question.

Input

library(tidyverse)

df <- tibble(Name = LETTERS[1:4], 
             OTU = c("A;", "A; B;", "A; B; C; D; E; F;", "A; B; C; D; E;"))


df
# A tibble: 4 x 2
  Name  OTU              
  <chr> <chr>            
1 A     A;               
2 B     A; B;            
3 C     A; B; C; D; E; F;
4 D     A; B; C; D; E;

Code and output

First count the number of ";" in the column, if it's fewer than 6, add some more at the end of the string in OTU. If it is not fewer than 6, use the original OTU value.

This will entirely replace the original OTU column.

df %>% mutate(OTU = ifelse(str_count(OTU, ";") < 6, 
                           paste(OTU, str_dup("; ", 6 - str_count(OTU, ";"))), 
                           OTU))

# A tibble: 4 x 2
  Name  OTU                
  <chr> <chr>              
1 A     "A; ; ; ; ; ; "    
2 B     "A; B; ; ; ; ; "   
3 C     "A; B; C; D; E; F;"
4 D     "A; B; C; D; E; ; "

CodePudding user response：

We could use separate with the fill argument from tidyr package and then paste them all together and finally replace NA by ""

library(tidyverse)

df %>% 
  separate(col1, c("a","b","c","d","e","f","g"), fill = "right", sep = ";") %>% 
  mutate(col1 = paste(a,b,c,d,e,f,g, sep = "; "), .keep="unused") %>% 
  mutate(col1 = str_replace_all(col1, "NA", ""))

   col1                                                                                                     
   <chr>                                                                                                    
 1 "Bacteria;  Proteobacteria;  Gammaproteobacteria; ; ; ; "                                                
 2 "Bacteria;  Proteobacteria;  Gammaproteobacteria; ; ; ; "                                                
 3 "Bacteria;  Actinobacteria;  Rubrobacteria;  Rubrobacterales;  Rubrobacteraceae;  Rubrobacter; "         
 4 "Bacteria;  Gemmatimonadetes;  Gemm-1; ; ; ; "                                                           
 5 "Bacteria;  Proteobacteria;  Gammaproteobacteria;  Chromatiales; ; ; "                                   
 6 "Bacteria;  Actinobacteria;  Nitriliruptoria;  Nitriliruptorales;  Nitriliruptoraceae; ; "               
 7 "Bacteria;  Proteobacteria;  Gammaproteobacteria;  Chromatiales; ; ; "                                   
 8 "Bacteria;  Actinobacteria;  Thermoleophilia;  Solirubrobacterales; ; ; "                                
 9 "Bacteria;  Proteobacteria;  Gammaproteobacteria;  Chromatiales; ; ; "                                   
10 "Bacteria;  Proteobacteria;  Alphaproteobacteria;  Sphingomonadales;  Sphingomonadaceae;  Kaistobacter; "
11 "Bacteria;  Actinobacteria;  Thermoleophilia;  Solirubrobacterales; ; ; "                                
12 "Bacteria;  Proteobacteria;  Betaproteobacteria;  Burkholderiales;  Oxalobacteraceae;  Ralstonia; "      
13 "Bacteria;  Actinobacteria;  Actinobacteria;  Actinomycetales;  Pseudonocardiaceae; ; "                  
14 "Bacteria;  Actinobacteria;  Actinobacteria;  Actinomycetales;  Micrococcaceae;  Arthrobacter; "

data:

structure(list(col1 = c("Bacteria; Proteobacteria; Gammaproteobacteria;", 
"Bacteria; Proteobacteria; Gammaproteobacteria;", "Bacteria; Actinobacteria; Rubrobacteria; Rubrobacterales; Rubrobacteraceae; Rubrobacter;", 
"Bacteria; Gemmatimonadetes; Gemm-1;", "Bacteria; Proteobacteria; Gammaproteobacteria; Chromatiales;", 
"Bacteria; Actinobacteria; Nitriliruptoria; Nitriliruptorales; Nitriliruptoraceae;", 
"Bacteria; Proteobacteria; Gammaproteobacteria; Chromatiales;", 
"Bacteria; Actinobacteria; Thermoleophilia; Solirubrobacterales;", 
"Bacteria; Proteobacteria; Gammaproteobacteria; Chromatiales;", 
"Bacteria; Proteobacteria; Alphaproteobacteria; Sphingomonadales; Sphingomonadaceae; Kaistobacter;", 
"Bacteria; Actinobacteria; Thermoleophilia; Solirubrobacterales;", 
"Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales; Oxalobacteraceae; Ralstonia;", 
"Bacteria; Actinobacteria; Actinobacteria; Actinomycetales; Pseudonocardiaceae;", 
"Bacteria; Actinobacteria; Actinobacteria; Actinomycetales; Micrococcaceae; Arthrobacter;"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-14L))