I have two segments of a random fasta file
1 Segment1 AAGGTTCC
2 Segment2 CCTTGGAA
I have another random data set containing dinucleotides' energy values as
AA -1.0
AG -2.0
GG -1.5
GT -1.7
TT -1.2
TC -1.8
CC -1.4
CT -2.5
TG -2.1
GA -2.3
Here, I want to analyze and compare the nucleotides of the two fasta segments with the given energy values in a 'sliding window algorithm' such that the energy output value for fasta segment1 would be average of all the possile dinucleotide combination in an overlapping sliding window manner which will give the answer as -10.6 i.e. {(-1.0) (-2.0) (-1.5) (-1.7) (-1.2) (-1.8) (-1.4)}/7 and the same computation would be performed for segment2, using the help of for and if else loop preferably.
CodePudding user response:
Here's a way to do it in the tidyverse
. First, create a vector of two consecutive characters in the string (using f
). Then, with some pivoting, merge with the second dataset and compute the sum by group.
library(tidyverse)
f <- function(string) sapply(seq(nchar(string[1]) - 1), \(i) substr(string, i, i 1))
df1 %>%
mutate(data.frame(f(df1$Segment))) %>%
pivot_longer(-c(ID, Segment), values_to = "Dinu") %>%
inner_join(df2) %>%
group_by(ID, Segment) %>%
summarise(sum = sum(Value))
ID Segment sum
<chr> <chr> <dbl>
1 Segment1 AAGGTTCC -10.6
2 Segment2 CCTTGGAA -12
data
df1 <- read.table(header = F, text = "1 Segment1 AAGGTTCC
2 Segment2 CCTTGGAA")[, 2:3]
colnames(df1) <- c("ID", "Segment")
df2 <- read.table(header = F, text = " AA -1.0
AG -2.0
GG -1.5
GT -1.7
TT -1.2
TC -1.8
CC -1.4
CT -2.5
TG -2.1
GA -2.3")
colnames(df2) <- c("Dinu", "Value")
CodePudding user response:
Here is a data.table
approach (returning the sum and the mean; it was unclear in your post which one you wanted, but in case of different sequence lengths mean might more sense).
The idea is to vectorize the sequence, transpose and shift by one, then combine and retrieve the values from the named vector of energy values. Would be interesting to compare performance on more and/or longer sequences, but it seems to be faster than the proposed tidyverse
approach. I am sure this can still be improved, though.
library(data.table)
dt1 <- data.table(df1, key=c("ID", "Segment"))
dt2 <- with(df2, setNames(Value, Dinu))
dt1[, e:= lapply(.SD, \(x) strsplit(x, "")), by="ID", .SDcols="Segment"]
dt1[, e2 := data.table::shift(e, 1, type="lead")]
dt1 <- dt1[, lapply(.(e, e2), unlist), by = list(Segment, ID)]
dt1[, .(sum = sum(dt2[paste0(V1, V2)], na.rm=TRUE),
mean = mean(dt2[paste0(V1, V2)], na.rm=TRUE)), by=.(Segment, ID)][]
#> Segment ID sum mean
#> 1: AAGGTTCC Segment1 -10.6 -1.514286
#> 2: CCTTGGAA Segment2 -12.0 -1.714286
CodePudding user response:
Here is another way using tidytext
. We are using the 'character shingles` tokenizer which breaks it up the way you are looking for.
library(tidytext)
library(dplyr)
# you have to use both to_lower = FALSE and lowercase = FALSE unfortunately
df1 %>%
unnest_character_shingles("Dinu", "Segment", n = 2L, to_lower = FALSE,
lowercase = FALSE, drop = FALSE) %>%
left_join(df2, by = "Dinu") %>%
group_by(ID, Segment) %>%
summarize(mean = mean(Value))
Which gives the result:
# A tibble: 2 x 3
# Groups: ID [2]
ID Segment mean
<chr> <chr> <dbl>
1 Segment1 AAGGTTCC -1.51
2 Segment2 CCTTGGAA -1.71