I have a vector containing DNA sequences strings:
x <- c("ATTAGCCGAGC", "TTCCGGTTAA")
I would like to transform these strings into a sum according to the rule
A <- 2
T <- 2
G <- 4
C <- 4
so that ATTAGCCGAGC is translated to "2 2 2 2 4 4 4 4 2 4 4" and the final output would be "34".
Desired output: A dataframe consisting of a a column of the original vector X and another column of the "sum-transformations".
Thanks.
I hope that its not a problem to use "T".
CodePudding user response:
You can create a named vector with the values, split the strings, match and sum, i.e.
vals <- setNames(c(2, 2, 4, 4), c('A', 'T', 'G', 'C'))
sapply(strsplit(x, ''), \(i)sum(vals[i]))
#[1] 34 28
Put the in a dataframe like that,
data.frame(string = x,
val = sapply(strsplit(x, ''), \(i)sum(vals[i])))
string val
1 ATTAGCCGAGC 34
2 TTCCGGTTAA 28
CodePudding user response:
I guess you can try chartr
utf8ToInt
like below
> sapply(chartr("ATGC", "2244", x), function(v) sum(utf8ToInt(v) - 48))
22224444244 2244442222
34 28
CodePudding user response:
One approach would be to use gsub
twice to map the base pair symbols to either 2 or 4. Then, use a custom digit summing function to get the sums:
x <- c("ATTAGCCGAGC", "TTCCGGTTAA")
x <- as.numeric(gsub("[CG]", "4", gsub("[AT]", "2", x)))
digitsum <- function(x) sum(floor(x / 10^(0:(nchar(x) - 1))) %% 10)
sapply(x, function(x) digitsum(x))
[1] 34 28
The digit sum function was taken from this helpful SO question.
CodePudding user response:
Using chartr
:
chartr("ATGC", "2244", x) |>
strsplit(split = "") |>
sapply(function(x) sum(as.numeric(x)))
#[1] 34 28
In a dataframe:
chr2int <- function(x){
chartr("ATGC", "2244", x) |>
strsplit(split = "") |>
sapply(function(str) sum(as.numeric(str)))
}
transform(data.frame(x),
s = chr2int(x))
# x s
#1 ATTAGCCGAGC 34
#2 TTCCGGTTAA 28