I am new to coding and struggling for a month. Please help me with this. Sample fasta sequence (Almost 100M reads)
Line_1 CGAATTCGCCTTTGAGATTGAGTGTGAAGTTAATATTCATAGCTTCACGCTCGATCTCAAAGGCTTTTTTGGCAAGCAAAAGACGGCATACGAGATATCAACCACATGTGTGCATGACTAGCCGATCGCAGCGGCCGCATACGATTGCT
Line_2 CGAATTCGCCTTTGAGATTGAGTGTGAAGTTAATATTCATAGCTTCACGCTCGATCTCAAAGGCTTTTTTGGCAAGCAAAAGACGGCATACGAGATATCAACCACATGTGTGCATGACTATGCCGTACCAGCGGCCGCATACGATTGCT
Line_3 CGAATTCGCCTTTGAGATTGAGTGTGAAGTTAATATTCATAGCTTCACGCTCGATCTCAAAGGCTTTTTTGGCAAGCAAAAGACGGCATACGAGATATGTACCACATGTGTGACGTACTTTACGTACGCAGCGGCCGCATACGATTGCT
I wish to find the pattern "GGCAAGCAAAAGACGGCATACGAGATAT" Then immediately after Pattern,18nt (Before ACT) needs to be captured and written in an excel sheet with column "18nt bases" and count in the next column. Like,
Barcodes | Count_in_reads |
---|---|
CAACCACATGTGTGCATG | 2 |
GTACCACATGTGTGACGT | 1 |
Something like this.
In the second part, I wish to take this 18nt column to search again sequences to find (9bp after ACT that is after 18nt) and count them in each 18nt barcodes type.
Barcodes | Random_barcode | Frequency_random_barcode |
---|---|---|
CAACCACATGTGTGCATG | AGCCGATCG | 1 |
CAACCACATGTGTGCATG | ATGCCGTAC | 1 |
GTACCACATGTGTGACGT | TTACGTACG | 1 |
So one 18nt barcode could have several random 9bp sequences. Shown as 2 rows with the same 18nt barcodes but the random barcode is different with frequency 1 for each.
Please let me know if I could explain the question.
Thanks for the help!!!
CodePudding user response:
Suppose your data is stored in a data.frame named df
and the sequences are in a column named sequence
.
library(dplyr)
library(stringr)
df %>%
mutate(Barcodes = str_extract(sequence, "(?<=GGCAAGCAAAAGACGGCATACGAGATAT).{18}")) %>%
count(Barcodes)
returns
# A tibble: 2 x 2
Barcodes n
<chr> <int>
1 CAACCACATGTGTGCATG 2
2 GTACCACATGTGTGACGT 1
and
df %>%
mutate(
Barcodes = str_extract(sequence, "(?<=GGCAAGCAAAAGACGGCATACGAGATAT).{18}"),
Random_barcode = str_extract(sequence, "(?<=GGCAAGCAAAAGACGGCATACGAGATAT.{18}ACT).{9}")
) %>%
count(Barcodes, Random_barcode)
returns
# A tibble: 3 x 3
Barcodes Random_barcode n
<chr> <chr> <int>
1 CAACCACATGTGTGCATG AGCCGATCG 1
2 CAACCACATGTGTGCATG ATGCCGTAC 1
3 GTACCACATGTGTGACGT TTACGTACG 1