Home > Software design >  Is there a way on R to represent the alternation of these sequences, SEQ1 and SEQ2, within a single
Is there a way on R to represent the alternation of these sequences, SEQ1 and SEQ2, within a single

Time:03-16

I have a dataframe on R similar to this one, only it is 2000 rows long. Throughout the dataframe I have this alternation of SEQ1 and SEQ2 within a single read called "id read". These sequences alternate, and SEQ1 is always 1 nucleotide away from SEQ1, while SEQ2 from SEQ1 about 335 nucleotides, sometimes jumps and goes to 670. The sequences are both in forward and in revers, as can be seen from the value of the end coordinate which is sometimes less than the start coordinate.

sequence id read start end sequencedistance sequencelength
SEQ1 id read 90 105 1 15
SEQ2 id read 440 458 335 18
SEQ1 id read 459 474 1 15
SEQ2 id read 808 826 334 18
SEQ1 id read 827 812 1 15
SEQ2 id read 1148 1156 336 18
SEQ1 id read 1157 1172 1 15
SEQ2 id read 1850 1868 678 18
SEQ1 id read 1869 1854 1 15
SEQ2 id read 2187 2205 333 18
SEQ1 id read 2206 2221 1 15
SEQ2 id read 2887 2905 666 18

Would anyone have any ideas on how to plot this data and visually show the pattern that these sequences have within a read? I have tried plotting with horizontal lines, lollipop, point, but none of these methods are effective in representing the amount of data I have and to visually understand the behavior of these sequences. Would anyone have an idea of ​​how to plot the pattern? If I wanted, I could also plot only a part of the large dataframe I have, but at least I would like to understand the particularity of these sequences in the ultra-long read taken into consideration.

CodePudding user response:

I'm still not exactly sure what you are looking for, but if every row i where sequence == "SEQ" has a paired row i 1 where sequence == "SEQ2", you can calculate the relative start and ends sites and then try to visualise it.

Assuming your data is in a variable called df, you can calculate these as follows.

df <- transform(
  df,
  rel_start = ifelse(
    as.character(sequence) == "SEQ1",
    start - start,
    start - c(0, head(start, -1))
  ),
  rel_end = ifelse(
    as.character(sequence) == "SEQ1",
    end - start,
    end - c(0, head(start, -1))
  )
)

Then for visualisation, you can just use geom_segment(). You could use arrows to indicate the direction of the reads.

library(ggplot2)

ggplot(df, aes(rel_start, y = seq_along(start), colour = sequence))  
  geom_segment(aes(xend = rel_end, yend = seq_along(start)),
               arrow = arrow(length = unit(2, "mm")))

Data loading:

txt <- "sequence    id read     start   end     sequencedistance    sequencelength
SEQ1    id read     90  105     1   15
SEQ2    id read     440     458     335     18
SEQ1    id read     459     474     1   15
SEQ2    id read     808     826     334     18
SEQ1    id read     827     812     1   15
SEQ2    id read     1148    1156    336     18
SEQ1    id read     1157    1172    1   15
SEQ2    id read     1850    1868    678     18
SEQ1    id read     1869    1854    1   15
SEQ2    id read     2187    2205    333     18
SEQ1    id read     2206    2221    1   15
SEQ2    id read     2887    2905    666     18"

df <- read.table(text = txt, header = TRUE)
  • Related