Home > Software engineering >  I want to search for pattern and then grab 18 characters, further using this 18characters as base, w
I want to search for pattern and then grab 18 characters, further using this 18characters as base, w

Time:10-19

I am new to coding and struggling for a month. Please help me with this. Sample fasta sequence (Almost 100M reads)

Line_1 CGAATTCGCCTTTGAGATTGAGTGTGAAGTTAATATTCATAGCTTCACGCTCGATCTCAAAGGCTTTTTTGGCAAGCAAAAGACGGCATACGAGATATCAACCACATGTGTGCATGACTAGCCGATCGCAGCGGCCGCATACGATTGCT

Line_2 CGAATTCGCCTTTGAGATTGAGTGTGAAGTTAATATTCATAGCTTCACGCTCGATCTCAAAGGCTTTTTTGGCAAGCAAAAGACGGCATACGAGATATCAACCACATGTGTGCATGACTATGCCGTACCAGCGGCCGCATACGATTGCT

Line_3 CGAATTCGCCTTTGAGATTGAGTGTGAAGTTAATATTCATAGCTTCACGCTCGATCTCAAAGGCTTTTTTGGCAAGCAAAAGACGGCATACGAGATATGTACCACATGTGTGACGTACTTTACGTACGCAGCGGCCGCATACGATTGCT

I wish to find the pattern "GGCAAGCAAAAGACGGCATACGAGATAT" Then immediately after Pattern,18nt (Before ACT) needs to be captured and written in an excel sheet with column "18nt bases" and count in the next column. Like,

Barcodes Count_in_reads
CAACCACATGTGTGCATG 2
GTACCACATGTGTGACGT 1

Something like this.

In the second part, I wish to take this 18nt column to search again sequences to find (9bp after ACT that is after 18nt) and count them in each 18nt barcodes type.

Barcodes Random_barcode Frequency_random_barcode
CAACCACATGTGTGCATG AGCCGATCG 1
CAACCACATGTGTGCATG ATGCCGTAC 1
GTACCACATGTGTGACGT TTACGTACG 1

So one 18nt barcode could have several random 9bp sequences. Shown as 2 rows with the same 18nt barcodes but the random barcode is different with frequency 1 for each.

Please let me know if I could explain the question.

Thanks for the help!!!

CodePudding user response:

Suppose your data is stored in a data.frame named df and the sequences are in a column named sequence.

library(dplyr)
library(stringr)

df %>% 
  mutate(Barcodes = str_extract(sequence, "(?<=GGCAAGCAAAAGACGGCATACGAGATAT).{18}")) %>% 
  count(Barcodes)

returns

# A tibble: 2 x 2
  Barcodes               n
  <chr>              <int>
1 CAACCACATGTGTGCATG     2
2 GTACCACATGTGTGACGT     1

and

df %>% 
  mutate(
    Barcodes = str_extract(sequence, "(?<=GGCAAGCAAAAGACGGCATACGAGATAT).{18}"),
    Random_barcode = str_extract(sequence, "(?<=GGCAAGCAAAAGACGGCATACGAGATAT.{18}ACT).{9}")
    ) %>% 
  count(Barcodes, Random_barcode)

returns

# A tibble: 3 x 3
  Barcodes           Random_barcode     n
  <chr>              <chr>          <int>
1 CAACCACATGTGTGCATG AGCCGATCG          1
2 CAACCACATGTGTGCATG ATGCCGTAC          1
3 GTACCACATGTGTGACGT TTACGTACG          1
  • Related