I want to generate one hot coding matrix for a list of DNA sequences. I have tried to solve my problem from the following link How to generate one hot encoding for DNA sequences? but some of the solutions are given only for one single DNA sequence and not for a list of DNA sequences.
For example
def one_hot_encode(seq):
mapping = dict(zip("ACGT", range(4)))
seq2 = [mapping[i] for i in seq]
return np.eye(4)[seq2]
one_hot_encode("AACGT")
In the given above code, if I run one_hot_encode("AACGT","GGTAC","CGTAC")
it will fail, also i want to generate matrix as output.
Currently, I am working in R and below is my DNA sequence in the r data frame(single-column file)
ACTTTA
TTGATG
CTTACG
GTACGT
Expected output
1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
is it possible to do this in R?
CodePudding user response:
Here’s a solution using base R that generates a transpose of your solution. That is, it creates a one-hot column vector for each individual character and concatenates these columns (i.e. there are always four rows, regardless of the number of strings).
(For a solution that produces the same format as the question, see the bottom.)
sequences = c('ACTTTA', 'TTGATG', 'GATTACA')
strsplit(sequences, '') |>
lapply(match, table = c('A', 'C', 'G', 'T')) |>
lapply(\(x) {
m = diag(0L, nrow = 4L, ncol = length(x))
m[cbind(x, seq_along(x))] = 1L
m
}) |>
do.call('cbind', args = _)
Or, alternatively (shorter but not necessarily more readable):
strsplit(sequences, '') |>
lapply(match, table = c('A', 'C', 'G', 'T')) |>
unlist() %>%
{diag(1L, 4L)[, .]}
Unfortunately this requires the ‘magrittr’ pipe but we can change this (and make it more readable again) by abstracting the expression on the last line into a function:
num_to_one_hot = function (x, bits) {
diag(1L, bits)[, x]
}
This is a nice, general function to one-hot encode a numeric vector. Furthermore, we can unlist
after the first step, which allows us to avoid lapply
. And with that our DNA encoding code becomes:
strsplit(sequences, '') |>
unlist() |>
match(c('A', 'C', 'G', 'T')) |>
num_to_one_hot(bits = 4L)
… which I find by far the most self-explanatory (and most readable!) of all the alternatives. It’s also fully vectorised, and doesn’t use lapply
or similar, so it’s also more efficient.
For completeness, here’s a solution that produces the same output as requested in the question, by transforming the matrix produced by the previous algorithm from col-major to row-major orientation:
strsplit(sequences, '') |>
unlist() |>
match(c('A', 'C', 'G', 'T')) |>
num_to_one_hot(4L) |>
matrix(nrow = length(sequences), byrow = TRUE)
I benchmarked all currently posted solutions, and the ones using num_to_one_hot
(Konrad2
and Konrad3
below) are the fastest:
> bench::mark(Maël(), Paul(), Konrad1(), Konrad2(), Konrad3(), GKi1(), GKi2(), Thomas(), check = FALSE)
# # A tibble: 8 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 Maël() 776µs 818.25µs 1154. 1.36KB 13.3 522 6 452.2ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 2 Paul() 1.07ms 1.17ms 743. 4.38KB 2.04 365 1 491.2ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 3 Konrad1() 21.7µs 23.7µs 39401. 432B 15.8 9996 4 253.7ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 4 Konrad2() 4.8µs 5.7µs 148619. 1.69KB 14.9 9999 1 67.3ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 5 Konrad3() 6.6µs 7.9µs 108540. 2.11KB 10.9 9999 1 92.1ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 6 GKi1() 9.2µs 10.8µs 83596. 960B 16.7 9998 2 119.6ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 7 GKi2() 258.3µs 278µs 3442. 960B 0 1721 0 500ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 8 Thomas() 10µs 11.6µs 77604. 2.39KB 7.76 9999 1 128.8ms <NULL> <Rprofmem> <bench_tm> <tibble>
CodePudding user response:
In base R, you can do:
dna <- c("A", "C", "G", "T")
dna_seq <- c("ACTTTA", "TTGATG")
one_hot_encode <- function(x){
spl <- strsplit(x, "")[[1]]
fa <- factor(spl, levels = dna)
sapply(fa, table) |>
Reduce(f = c, x = _)
}
data.frame(do.call(rbind, lapply(dna_seq, one_hot_encode)))
output
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24
1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
2 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
CodePudding user response:
library(stringr)
dataIn <- c(
"ACTTTA",
"TTGATG",
"CTTACG",
"GTACGT"
)
one_hot_encode <- function(baseSeq) {
outSeq <- stringr::str_replace_all(baseSeq, c("A" = "1000",
"C" = "0100",
"G" = "0010",
"T" = "0001"))
outSeq <- str_extract_all(outSeq, boundary("character"))
unlist(outSeq)
}
data.frame(do.call(rbind,lapply(dataIn, one_hot_encode)))
gives
> data.frame(do.call(rbind,lapply(dataIn, one_hot_encode)))
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24
1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
2 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
3 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
4 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
Some row and column names might tidy up the output, but I think this is essentially what you were after?
CodePudding user response:
In base R, we can use match
and utf8ToInt
to find the mapping position, and matrix
to format the output, e.g.,
dna <- c("ACTTTA", "TTGATG", "CTTACG", "GTACGT")
matrix(t(diag(4)[match(unlist(lapply(dna, utf8ToInt)), utf8ToInt("ACGT")), ]), nrow = length(dna), byrow = TRUE)
such that we will obtain
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 1 0 0 0 0 1 0 0 0 0 0 1 0 0
[2,] 0 0 0 1 0 0 0 1 0 0 1 0 1 0
[3,] 0 1 0 0 0 0 0 1 0 0 0 1 1 0
[4,] 0 0 1 0 0 0 0 1 1 0 0 0 0 1
[,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
[1,] 0 1 0 0 0 1 1 0 0 0
[2,] 0 0 0 0 0 1 0 0 1 0
[3,] 0 0 0 1 0 0 0 0 1 0
[4,] 0 0 0 0 1 0 0 0 0 1
CodePudding user response:
Another base way.
s <- c("ACTTTA", "TTGATG", "CTTACG", "GTACGT")
dna <- c("A", "C", "G", "T")
lup <- setNames(asplit(diag(length(dna)), 1), dna)
lapply(strsplit(s, "", TRUE), \(x) unlist(lup[x], FALSE, FALSE))
#[[1]]
# [1] 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
#
#[[2]]
# [1] 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
#
#[[3]]
# [1] 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
#
#[[4]]
# [1] 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
Or using gsub
. <- gsub("A", "1000", s)
. <- gsub("C", "0100", .)
. <- gsub("G", "0010", .)
. <- gsub("T", "0001", .)
cbind(s, .)
# s .
#[1,] "ACTTTA" "100001000001000100011000"
#[2,] "TTGATG" "000100010010100000010010"
#[3,] "CTTACG" "010000010001100001000010"
#[4,] "GTACGT" "001000011000010000100001"
lapply(strsplit(., "", TRUE), as.integer)
#[[1]]
# [1] 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
#
#[[2]]
# [1] 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
#
#[[3]]
# [1] 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
#
#[[4]]
# [1] 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
Or
lapply(strsplit(s, "", TRUE), \(x) c(diag(length(dna))[,match(x, dna)]))
Or
matrix(diag(4)[,unlist(lapply(strsplit(s, "", TRUE), match, dna), FALSE, FALSE)], length(s), byrow = TRUE)
Benchmark
library(magrittr) #For Konrad
s <- c("ACTTTA", "TTGATG", "CTTACG", "GTACGT")
dna <- c("A", "C", "G", "T")
bench::mark(check=FALSE,
GKi1 = {lup <- setNames(asplit(diag(length(dna)), 1), dna)
lapply(strsplit(s, "", TRUE), \(x) unlist(lup[x], FALSE, FALSE))},
GKi2 = {. <- gsub("A", "1000", s, perl=TRUE)
. <- gsub("C", "0100", ., perl=TRUE)
. <- gsub("G", "0010", ., perl=TRUE)
gsub("T", "0001", ., perl=TRUE)},
GKi3 = lapply(strsplit(s, "", TRUE), \(x) c(diag(length(dna))[,match(x, dna)])),
GKi4 = matrix(diag(4)[,unlist(lapply(strsplit(s, "", TRUE), match, dna), FALSE, FALSE)], length(s), byrow = TRUE),
Konrad = {
strsplit(s, '') |>
lapply(match, table = c('A', 'C', 'G', 'T')) |>
unlist() %>%
{replace(diag(0L, 4L, length(.)), cbind(., seq_along(.)), 1L)}
},
Thomas = matrix(t(diag(4)[match(unlist(lapply(s, utf8ToInt)), utf8ToInt("ACGT")), ]), nrow = length(dna), byrow = TRUE)
)
Result
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 GKi1 49.4µs 54.1µs 17878. 7.16KB 14.5 8622 7 482ms
2 GKi2 35.4µs 37.1µs 26122. 0B 5.23 9998 2 383ms
3 GKi3 28.1µs 31µs 31574. 16.74KB 15.8 9995 5 317ms
4 GKi4 19µs 21.3µs 46142. 1.59KB 18.5 9996 4 217ms
5 Konrad 25.6µs 28.6µs 34071. 672B 13.6 9996 4 293ms
6 Thomas 19.6µs 21.6µs 45615. 2.39KB 18.3 9996 4 219ms