is there any way to find the specific DNA sequence motif in FASTA data format file in R?-CodePudding

I have a question and it would be great if anybody who knows help me in it. I have a big FASTA format file that contains FASTA sequence of the chip-seq peaks. I did motif analysis for my data and now I have many known motifs like FOXA1 or SP1 motif. But actually I want to know, in which position (peak) I have sequence of that specific motif. So, I have the sequence of the motif and I want to search this motif in my FASTA file to find where there is this sequence . I hope i delivered the concept to you. This is completely new for me and I have'nt had any experiance in this issue.

df1<-data.frame(position=c( ">69366501",">69366750",">69368467",">69766890"),SEQ=c("GAAGGAGAAGGGGAGAGACTGGAAGAGAAGGAAGGAGCTTTGGG","AAGAGGGGAAAAGACCATAGCAGAGAGCTCAGCCTGACCACGG","AAGAGGGGAATTTGATAGCAGTGCAGCTAGCTAGCGGGCACCACGG","GAAGGAGAAGGGGAGAGACTGGTATAGATGACCATAGGAAGGAGCTTTGGG"))

and for example the sequence that I want to search it and find it in my file is "AGTGCAGCTAGCTAGCGGGC" so, I want to see that I have this motif sequence in my file or not and if yes, where is it, in whcih position.

it is not exact file that I have but I think it is good to deliver my meaning. My main and original file is in FASTA format.

Thanks in advance.

CodePudding user response：

In addition to Chris's point, if you are trying to find the location of the string within the SEQ column, this approach should work:

if( require(tidyverse) ==F ) install.packages('tidyverse'); library(tidyverse)

df1 <-data.frame(
  position=c( ">69366501",">69366750",">69368467",">69766890"),
  SEQ=c("GAAGGAGAAGGGGAGAGACTGGAAGAGAAGGAAGGAGCTTTGGG",
        "AAGAGGGGAAAAGACCATAGCAGAGAGCTCAGCCTGACCACGG",
        "AAGAGGGGAATTTGATAGCAGTGCAGCTAGCTAGCGGGCACCACGG",
        "GAAGGAGAAGGGGAGAGACTGGTATAGATGACCATAGGAAGGAGCTTTGGG")
)

df1$SEQ %>% stringr::str_locate(pattern = "AGTGCAGCTAGCTAGCGGGC")

And here is the solution output, which shows that this pattern only existed in the third column and it started at character 20 and ended at character 39, which makes sense since nchar("AGTGCAGCTAGCTAGCGGGC") would return 20 (i.e., 39-20 1).

CodePudding user response：

And noodling around in base, though @Fadel-Megahed's certainly seems easier... anticipating you might want position start of the embedded substring:

df2 <-data.frame(
  position=c( ">69366501",">69366750",">69368467",">69766890"),
  SEQ=c("GAAGGAGAAGGGGAGAGACTGGAAGAGAAGGAAGGAGCTTTGGG",
        "AAGAGGGGAAAAGACCATAGCAGAGAGCTCAGCCTGACCACGG",
        "AAGAGGGGAATTTGATAGCAGTGCAGCTAGCTAGCGGGCACCACGG",
        "GAAGGAGAAGGGGAGAGACTGGTATAGATGACCATAGGAAGGAGCTTTGGG")
)

d2$position <- gsub('>', '', df2$position)
# what we're looking for in SEQ
term <- 'AGTGCAGCTAGCTAGCGGGC'

strsplit(df2$SEQ, term, fixed = TRUE)
[[1]]
[1] "GAAGGAGAAGGGGAGAGACTGGAAGAGAAGGAAGGAGCTTTGGG"

[[2]]
[1] "AAGAGGGGAAAAGACCATAGCAGAGAGCTCAGCCTGACCACGG"

[[3]]
[1] "AAGAGGGGAATTTGATAGC" "ACCACGG"            

[[4]]
[1] "GAAGGAGAAGGGGAGAGACTGGTATAGATGACCATAGGAAGGAGCTTTGGG"

# only one string is split by `term`, SEQ[3] which is the same as our `grep(`
grep(term,df2$SEQ) 
[1] 3

nchar(unlist(strsplit(df2$SEQ[3], term, fixed = TRUE))[1])
[1] 19
nchar(term)
[1] 20
nchar(unlist(strsplit(df2$SEQ[3], term, fixed = TRUE))[2])
[1] 7
> nchar(df2$SEQ[3])
[1] 46
(nchar(unlist(strsplit(df2$SEQ[3], term, fixed = TRUE))[1])   nchar(term)   nchar(unlist(strsplit(df2$SEQ[3], term, fixed = TRUE))[2]) == nchar(df2$SEQ[3]))
[1] TRUE
as.numeric(df2$position[3])   as.numeric(nchar(unlist(strsplit(df2$SEQ[3], term, fixed = TRUE))[1]))
[1] 69368486 # which may well want  1 depending...

It is hard to imagine that Bioconductor doesn't have convenience utilities that anticipate these needs in the realm of FASTA data, and would be worth investigating.