Split a fasta file into desired nucelotide length but in a sliding window format-CodePudding

Stuck at a R problem.

I have a random fasta file:

> header 
AGTCAGTCAGTC

My desired output is:

segment1   AGTC
segment2   GTCA
segment3   TCAG
segment4   CAGT
segment5   AGTC
segment6   GTCA
segment7   TCAG
segment8   CAGT
segment9   AGTC
segment10  GTC
segment11  TC
segment12  C

Any help will be greatly appreciated!

CodePudding user response：

library(tidyverse)

seq <- "AGTCAGTCAGTC"

seq %>%
  nchar() %>%
  seq() %>%
  tibble(name = .) %>%
  mutate(
    sq = str_sub(seq, name, name   3),
    fasta = str_glue(">{name}\n{sq}")
  ) %>%
  pull(fasta) %>%
  write_lines("out.fasta")

resulting in file out.fasta containing

>1
AGTC
>2
GTCA
>3
TCAG
>4
CAGT
>5
AGTC
>6
GTCA
>7
TCAG
>8
CAGT
>9
AGTC
>10
GTC
>11
TC
>12
C

CodePudding user response：

You can use substr to extract substring from a string.

First read in the FASTA file, skip = 1 to ignore the header. Then use sapply to go through all characters in myseq, extracting strings from position x to x 3. Finally setNames to fit your desired output.

However, the structure of your desired output is not clear, and my code will generate a named list.

myseq <- scan("testing.fasta", character(), skip = 1)

myseq_4mer <- lapply(1:nchar(myseq), function(x) substr(myseq, x, x   3))

setNames(myseq_4mer, paste(">segment", 1:length(myseq_4mer)))

$`>segment 1`
[1] "AGTC"

$`>segment 2`
[1] "GTCA"

$`>segment 3`
[1] "TCAG"

$`>segment 4`
[1] "CAGT"

$`>segment 5`
[1] "AGTC"

$`>segment 6`
[1] "GTCA"

$`>segment 7`
[1] "TCAG"

$`>segment 8`
[1] "CAGT"

$`>segment 9`
[1] "AGTC"

$`>segment 10`
[1] "GTC"

$`>segment 11`
[1] "TC"

$`>segment 12`
[1] "C"

Or save it to a FASTA file:

library(seqinr)

write.fasta(myseq_4mer, paste("segment", 1:length(myseq_4mer)), "out.fasta")

>segment 1
AGTC
>segment 2
GTCA
>segment 3
TCAG
>segment 4
CAGT
>segment 5
AGTC
>segment 6
GTCA
>segment 7
TCAG
>segment 8
CAGT
>segment 9
AGTC
>segment 10
GTC
>segment 11
TC
>segment 12
C

CodePudding user response：

You can split the string into characters and apply a rolling function with rollapply() from zoo.

x <- "AGTCAGTCAGTC"

zoo::rollapply(strsplit(x, "")[[1]], 4, paste, collapse = "",
               partial = TRUE, align = "left")

# [1] "AGTC" "GTCA" "TCAG" "CAGT" "AGTC" "GTCA" "TCAG" "CAGT" "AGTC" "GTC"  "TC"   "C"

partial = TRUE: The subset of indexes that are in range are passed to FUN.
align = "left": The index of the result should be left-aligned compared to the rolling window of observations.