Stuck at a R
problem.
I have a random fasta file:
> header
AGTCAGTCAGTC
My desired output is:
segment1 AGTC
segment2 GTCA
segment3 TCAG
segment4 CAGT
segment5 AGTC
segment6 GTCA
segment7 TCAG
segment8 CAGT
segment9 AGTC
segment10 GTC
segment11 TC
segment12 C
Any help will be greatly appreciated!
CodePudding user response:
library(tidyverse)
seq <- "AGTCAGTCAGTC"
seq %>%
nchar() %>%
seq() %>%
tibble(name = .) %>%
mutate(
sq = str_sub(seq, name, name 3),
fasta = str_glue(">{name}\n{sq}")
) %>%
pull(fasta) %>%
write_lines("out.fasta")
resulting in file out.fasta
containing
>1
AGTC
>2
GTCA
>3
TCAG
>4
CAGT
>5
AGTC
>6
GTCA
>7
TCAG
>8
CAGT
>9
AGTC
>10
GTC
>11
TC
>12
C
CodePudding user response:
You can use substr
to extract substring from a string.
First read in the FASTA file, skip = 1
to ignore the header. Then use sapply
to go through all characters in myseq
, extracting strings from position x
to x 3
. Finally setNames
to fit your desired output.
However, the structure of your desired output is not clear, and my code will generate a named list.
myseq <- scan("testing.fasta", character(), skip = 1)
myseq_4mer <- lapply(1:nchar(myseq), function(x) substr(myseq, x, x 3))
setNames(myseq_4mer, paste(">segment", 1:length(myseq_4mer)))
$`>segment 1`
[1] "AGTC"
$`>segment 2`
[1] "GTCA"
$`>segment 3`
[1] "TCAG"
$`>segment 4`
[1] "CAGT"
$`>segment 5`
[1] "AGTC"
$`>segment 6`
[1] "GTCA"
$`>segment 7`
[1] "TCAG"
$`>segment 8`
[1] "CAGT"
$`>segment 9`
[1] "AGTC"
$`>segment 10`
[1] "GTC"
$`>segment 11`
[1] "TC"
$`>segment 12`
[1] "C"
Or save it to a FASTA file:
library(seqinr)
write.fasta(myseq_4mer, paste("segment", 1:length(myseq_4mer)), "out.fasta")
>segment 1
AGTC
>segment 2
GTCA
>segment 3
TCAG
>segment 4
CAGT
>segment 5
AGTC
>segment 6
GTCA
>segment 7
TCAG
>segment 8
CAGT
>segment 9
AGTC
>segment 10
GTC
>segment 11
TC
>segment 12
C
CodePudding user response:
You can split the string into characters and apply a rolling function with rollapply()
from zoo
.
x <- "AGTCAGTCAGTC"
zoo::rollapply(strsplit(x, "")[[1]], 4, paste, collapse = "",
partial = TRUE, align = "left")
# [1] "AGTC" "GTCA" "TCAG" "CAGT" "AGTC" "GTCA" "TCAG" "CAGT" "AGTC" "GTC" "TC" "C"
partial = TRUE
: The subset of indexes that are in range are passed toFUN
.align = "left"
: The index of the result should be left-aligned compared to the rolling window of observations.