Home > database >  Finding and substituting set of codes/words in a file based on a list of old and corrected ones
Finding and substituting set of codes/words in a file based on a list of old and corrected ones

Time:03-19

I have a FASTA_16S.txt file containing paragraphs of different lengths with a unique code (e.g. 16S317) at the top. After transfer into R, I have a list with 413 members that looks like this:

[1]">16S317_V._rotiferianus_A\n
AAATTGAAGAGTTTGATCATGGCTCAG..."
[2]">16S318_Salmonella_bongori\n
AAATTGAAGAGTTTGATCATGGCTCAGATT..."
[3]">16S319_Escherichia_coli\n
TTGAAGAGTTTGATCATGGCTCAGATTG...

I need to substitute the existing codes with the new ones from a table Code_16S:

     Old    New
 1. 16S317 16S001
 2. 16S318 16S307 
 3. 16S319 16S211
 4.  ...    ...

Can anybody suggest a code that would identify an old code and substitute it with a new one? Consider that we have the same codes in columns New and Old, so direct application of gsub or replace for the whole list did not work (after a substitution we have two paragraphs with the same code, so one of the next steps will change both of them).

Below there is my solution for the problem, but I don´t consider it as an optimal.

CodePudding user response:

Instead of using lapply, it may be easier with str_replace_all

library(stringr)
library(tibble)
FASTA_16S <- str_replace_all(FASTA_16S, deframe(Code_16S))

-output

FASTA_16S
[1] ">16S001_V._rotiferianus_A\n\nAAATTGAAGAGTTTGATCATGGCTCAG..."   
[2] ">16S307_Salmonella_bongori\n\nAAATTGAAGAGTTTGATCATGGCTCAGATT..."

data

FASTA_16S <- c(">16S317_V._rotiferianus_A\n\nAAATTGAAGAGTTTGATCATGGCTCAG...", 
">16S318_Salmonella_bongori\n\nAAATTGAAGAGTTTGATCATGGCTCAGATT..."
)
Code_16S <- structure(list(Old = c("16S317", "16S318", "16S319"), New = c("16S001", 
"16S307", "16S211")), class = "data.frame", row.names = c("1.", 
"2.", "3."))

CodePudding user response:

As long as the new codes are sorted according to the old ones, which corresponds to the order of the paragraphs in the file, we can perform substitution paragraph by paragraph. (Initially the table was sorted by the column New)

Num = seq.int(1:413)   # total number of paragraphs
Code_16S = codes$New   

F_16S = function(x) {
  row = code_16S[x]
  gsub("^.{7}", paste(">", row, sep = ""), FASTA_16S[[1]][x])
}

N_16S = lapply(Num, F_16S)

With gsub("^[>].{7}", I tried to substitute first 6 characters (the code) except the first one (>) in each string, but it did not work, thus added paste function.

  • Related