How to generate one hot encoding for DNA sequences using R or python-CodePudding

I want to generate one hot coding matrix for a list of DNA sequences. I have tried to solve my problem from the following link How to generate one hot encoding for DNA sequences? but some of the solutions are given only for one single DNA sequence and not for a list of DNA sequences.

For example

def one_hot_encode(seq):
    mapping = dict(zip("ACGT", range(4)))    
    seq2 = [mapping[i] for i in seq]
    return np.eye(4)[seq2]

one_hot_encode("AACGT")

In the given above code, if I run one_hot_encode("AACGT","GGTAC","CGTAC") it will fail, also i want to generate matrix as output.

Currently, I am working in R and below is my DNA sequence in the r data frame(single-column file)

ACTTTA
TTGATG
CTTACG
GTACGT

Expected output

1   0   0   0   0   1   0   0   0   0   0   1   0   0   0   1   0   0   0   1   1   0   0   0
0   0   0   1   0   0   0   1   0   0   1   0   1   0   0   0   0   0   0   1   0   0   1   0
0   1   0   0   0   0   0   1   0   0   0   1   1   0   0   0   0   1   0   0   0   0   1   0
0   0   1   0   0   0   0   1   1   0   0   0   0   1   0   0   0   0   1   0   0   0   0   1

is it possible to do this in R?

CodePudding user response：

Here’s a solution using base R that generates a transpose of your solution. That is, it creates a one-hot column vector for each individual character and concatenates these columns (i.e. there are always four rows, regardless of the number of strings).

(For a solution that produces the same format as the question, see the bottom.)

sequences = c('ACTTTA', 'TTGATG', 'GATTACA')

strsplit(sequences, '') |>
  lapply(match, table = c('A', 'C', 'G', 'T')) |>
  lapply(\(x) {
    m = diag(0L, nrow = 4L, ncol = length(x))
    m[cbind(x, seq_along(x))] = 1L
    m
  }) |>
  do.call('cbind', args = _)

Or, alternatively (shorter but not necessarily more readable):

strsplit(sequences, '') |>
  lapply(match, table = c('A', 'C', 'G', 'T')) |>
  unlist() %>%
  {diag(1L, 4L)[, .]}

Unfortunately this requires the ‘magrittr’ pipe but we can change this (and make it more readable again) by abstracting the expression on the last line into a function:

num_to_one_hot = function (x, bits) {
  diag(1L, bits)[, x]
}

This is a nice, general function to one-hot encode a numeric vector. Furthermore, we can unlist after the first step, which allows us to avoid lapply. And with that our DNA encoding code becomes:

strsplit(sequences, '') |>
  unlist() |>
  match(c('A', 'C', 'G', 'T')) |>
  num_to_one_hot(bits = 4L)

… which I find by far the most self-explanatory (and most readable!) of all the alternatives. It’s also fully vectorised, and doesn’t use lapply or similar, so it’s also more efficient.

For completeness, here’s a solution that produces the same output as requested in the question, by transforming the matrix produced by the previous algorithm from col-major to row-major orientation:

strsplit(sequences, '') |>
  unlist() |>
  match(c('A', 'C', 'G', 'T')) |>
  num_to_one_hot(4L) |>
  matrix(nrow = length(sequences), byrow = TRUE)

I benchmarked all currently posted solutions, and the ones using num_to_one_hot (Konrad2 and Konrad3 below) are the fastest:

> bench::mark(Maël(), Paul(), Konrad1(), Konrad2(), Konrad3(), GKi1(), GKi2(), Thomas(), check = FALSE)
# # A tibble: 8 × 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 Maël()        776µs 818.25µs     1154.    1.36KB    13.3    522     6    452.2ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 2 Paul()       1.07ms   1.17ms      743.    4.38KB     2.04   365     1    491.2ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 3 Konrad1()    21.7µs   23.7µs    39401.      432B    15.8   9996     4    253.7ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 4 Konrad2()     4.8µs    5.7µs   148619.    1.69KB    14.9   9999     1     67.3ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 5 Konrad3()     6.6µs    7.9µs   108540.    2.11KB    10.9   9999     1     92.1ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 6 GKi1()        9.2µs   10.8µs    83596.      960B    16.7   9998     2    119.6ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 7 GKi2()      258.3µs    278µs     3442.      960B     0     1721     0      500ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 8 Thomas()       10µs   11.6µs    77604.    2.39KB     7.76  9999     1    128.8ms <NULL> <Rprofmem> <bench_tm> <tibble>

CodePudding user response：

In base R, you can do:

dna <- c("A", "C", "G", "T")
dna_seq <- c("ACTTTA", "TTGATG")

one_hot_encode <- function(x){
  spl <- strsplit(x, "")[[1]]
  fa <- factor(spl, levels = dna)
  sapply(fa, table) |>
    Reduce(f = c, x = _)
}

data.frame(do.call(rbind, lapply(dna_seq, one_hot_encode)))

output

  X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24
1  1  0  0  0  0  1  0  0  0   0   0   1   0   0   0   1   0   0   0   1   1   0   0   0
2  0  0  0  1  0  0  0  1  0   0   1   0   1   0   0   0   0   0   0   1   0   0   1   0

CodePudding user response：

library(stringr)

dataIn <- c(
  "ACTTTA", 
  "TTGATG", 
  "CTTACG", 
  "GTACGT"
  )

one_hot_encode <- function(baseSeq) {
 outSeq <- stringr::str_replace_all(baseSeq, c("A" = "1000",
                                  "C" = "0100",
                                  "G" = "0010",
                                  "T" = "0001"))
 outSeq <- str_extract_all(outSeq, boundary("character"))
 unlist(outSeq)
}

data.frame(do.call(rbind,lapply(dataIn, one_hot_encode)))

gives

 > data.frame(do.call(rbind,lapply(dataIn, one_hot_encode)))
  X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24
1  1  0  0  0  0  1  0  0  0   0   0   1   0   0   0   1   0   0   0   1   1   0   0   0
2  0  0  0  1  0  0  0  1  0   0   1   0   1   0   0   0   0   0   0   1   0   0   1   0
3  0  1  0  0  0  0  0  1  0   0   0   1   1   0   0   0   0   1   0   0   0   0   1   0
4  0  0  1  0  0  0  0  1  1   0   0   0   0   1   0   0   0   0   1   0   0   0   0   1

Some row and column names might tidy up the output, but I think this is essentially what you were after?

CodePudding user response：

In base R, we can use match and utf8ToInt to find the mapping position, and matrix to format the output, e.g.,

dna <- c("ACTTTA", "TTGATG", "CTTACG", "GTACGT")
matrix(t(diag(4)[match(unlist(lapply(dna, utf8ToInt)), utf8ToInt("ACGT")), ]), nrow = length(dna), byrow = TRUE)

such that we will obtain

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,]    1    0    0    0    0    1    0    0    0     0     0     1     0     0
[2,]    0    0    0    1    0    0    0    1    0     0     1     0     1     0
[3,]    0    1    0    0    0    0    0    1    0     0     0     1     1     0
[4,]    0    0    1    0    0    0    0    1    1     0     0     0     0     1
     [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
[1,]     0     1     0     0     0     1     1     0     0     0
[2,]     0     0     0     0     0     1     0     0     1     0
[3,]     0     0     0     1     0     0     0     0     1     0
[4,]     0     0     0     0     1     0     0     0     0     1

CodePudding user response：

Another base way.

s <- c("ACTTTA", "TTGATG", "CTTACG", "GTACGT")

dna <- c("A", "C", "G", "T")
lup <- setNames(asplit(diag(length(dna)), 1), dna)
lapply(strsplit(s, "", TRUE), \(x) unlist(lup[x], FALSE, FALSE))
#[[1]]
# [1] 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
#
#[[2]]
# [1] 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
#
#[[3]]
# [1] 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
#
#[[4]]
# [1] 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

Or using gsub

. <- gsub("A", "1000", s)
. <- gsub("C", "0100", .)
. <- gsub("G", "0010", .)
. <- gsub("T", "0001", .)
cbind(s, .)
#     s        .                         
#[1,] "ACTTTA" "100001000001000100011000"
#[2,] "TTGATG" "000100010010100000010010"
#[3,] "CTTACG" "010000010001100001000010"
#[4,] "GTACGT" "001000011000010000100001"

lapply(strsplit(., "", TRUE), as.integer)
#[[1]]
# [1] 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
#
#[[2]]
# [1] 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
#
#[[3]]
# [1] 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
#
#[[4]]
# [1] 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

lapply(strsplit(s, "", TRUE), \(x) c(diag(length(dna))[,match(x, dna)]))

matrix(diag(4)[,unlist(lapply(strsplit(s, "", TRUE), match, dna), FALSE, FALSE)], length(s), byrow = TRUE)

Benchmark

library(magrittr) #For Konrad
s <- c("ACTTTA", "TTGATG", "CTTACG", "GTACGT")
dna <- c("A", "C", "G", "T")

bench::mark(check=FALSE,
            GKi1 = {lup <- setNames(asplit(diag(length(dna)), 1), dna)
              lapply(strsplit(s, "", TRUE), \(x) unlist(lup[x], FALSE, FALSE))},
GKi2 = {. <- gsub("A", "1000", s, perl=TRUE)
  . <- gsub("C", "0100", ., perl=TRUE)
  . <- gsub("G", "0010", ., perl=TRUE)
  gsub("T", "0001", ., perl=TRUE)},
GKi3 = lapply(strsplit(s, "", TRUE), \(x) c(diag(length(dna))[,match(x, dna)])),
GKi4 = matrix(diag(4)[,unlist(lapply(strsplit(s, "", TRUE), match, dna), FALSE, FALSE)], length(s), byrow = TRUE),
Konrad = {
strsplit(s, '') |>
  lapply(match, table = c('A', 'C', 'G', 'T')) |>
  unlist() %>%
  {replace(diag(0L, 4L, length(.)), cbind(., seq_along(.)), 1L)}
},
Thomas = matrix(t(diag(4)[match(unlist(lapply(s, utf8ToInt)), utf8ToInt("ACGT")), ]), nrow = length(dna), byrow = TRUE)
         )

Result

  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 GKi1         49.4µs 54.1µs    17878.    7.16KB    14.5   8622     7      482ms
2 GKi2         35.4µs 37.1µs    26122.        0B     5.23  9998     2      383ms
3 GKi3         28.1µs   31µs    31574.   16.74KB    15.8   9995     5      317ms
4 GKi4           19µs 21.3µs    46142.    1.59KB    18.5   9996     4      217ms
5 Konrad       25.6µs 28.6µs    34071.      672B    13.6   9996     4      293ms
6 Thomas       19.6µs 21.6µs    45615.    2.39KB    18.3   9996     4      219ms