Home > Back-end >  transforming character strings ito sums when every character represents one number
transforming character strings ito sums when every character represents one number

Time:10-05

I have a vector containing DNA sequences strings:

x <- c("ATTAGCCGAGC", "TTCCGGTTAA")

I would like to transform these strings into a sum according to the rule

A <- 2
T <- 2
G <- 4
C <- 4

so that ATTAGCCGAGC is translated to "2 2 2 2 4 4 4 4 2 4 4" and the final output would be "34".

Desired output: A dataframe consisting of a a column of the original vector X and another column of the "sum-transformations".

Thanks.

I hope that its not a problem to use "T".

CodePudding user response:

You can create a named vector with the values, split the strings, match and sum, i.e.

vals <- setNames(c(2, 2, 4, 4), c('A', 'T', 'G', 'C'))
sapply(strsplit(x, ''), \(i)sum(vals[i]))
#[1] 34 28

Put the in a dataframe like that,

data.frame(string = x, 
           val = sapply(strsplit(x, ''), \(i)sum(vals[i])))

       string val
1 ATTAGCCGAGC  34
2  TTCCGGTTAA  28

CodePudding user response:

I guess you can try chartr utf8ToInt like below

> sapply(chartr("ATGC", "2244", x), function(v) sum(utf8ToInt(v) - 48))
22224444244  2244442222 
         34          28

CodePudding user response:

One approach would be to use gsub twice to map the base pair symbols to either 2 or 4. Then, use a custom digit summing function to get the sums:

x <- c("ATTAGCCGAGC", "TTCCGGTTAA")
x <- as.numeric(gsub("[CG]", "4", gsub("[AT]", "2", x)))
digitsum <- function(x) sum(floor(x / 10^(0:(nchar(x) - 1))) %% 10)
sapply(x, function(x) digitsum(x))

[1] 34 28

The digit sum function was taken from this helpful SO question.

CodePudding user response:

Using chartr:

chartr("ATGC", "2244", x) |>
  strsplit(split = "") |>
  sapply(function(x) sum(as.numeric(x)))
#[1] 34 28

In a dataframe:

chr2int <- function(x){
  chartr("ATGC", "2244", x) |>
    strsplit(split = "") |>
    sapply(function(str) sum(as.numeric(str)))
}

transform(data.frame(x), 
          s = chr2int(x))

#            x  s
#1 ATTAGCCGAGC 34
#2  TTCCGGTTAA 28
  • Related