I have a large number of individuals (30,000), each with a large number of genetic positions (in the format CHR, POS, REF, ALT):
require(data.table)
ind1 <- data.table('CHR' = c('chr1','chr2','chr3','chr4'),
'POS' = c(10111,343243,89128312,843242),
'REF' = c('A','T','C','G'),
'ALT' = c('G','A','T','C'))
ind2 <- data.table('CHR' = c('chr1','chr2','chr3','chr4'),
'POS' = c(48392423,343243,49347898239,8432423),
'REF' = c('C','T','A','G'),
'ALT' = c('T','A','G','C'))
I want to create a set of all unique variants present in these individuals. I'm currently attempting this using foreach
:
require(foreach)
ind_list <- list(ind1,ind2)
combineFunction <- function( x, ... ) {
unique(mapply(rbind,x,...,simplify=FALSE))
}
inds = c('ind1','ind2')
out <- foreach( ind = 1:length(inds), .combine = 'combineFunction', .inorder = FALSE, .multicombine = TRUE ) %do% {
data <- ind_list[[ind]]
return(data)
}
However, this doesn't give me the desired output:
CHR POS REF ALT
[1,] "chr1" "10111" "A" "G"
[2,] "chr1" "48392423" "C" "T"
[3,] "FALSE" "0" "FALSE" "FALSE"
[4,] "chr2" "343243" "T" "A"
[5,] "chr3" "89128312" "C" "T"
[6,] "chr3" "49347898239" "A" "G"
[7,] "chr4" "843242" "G" "C"
[8,] "chr4" "8432423" "G" "C"
My desired output is equivalent to unique(rbind(ind1,ind2))
, or:
CHR POS REF ALT
1: chr1 10111 A G
2: chr2 343243 T A
3: chr3 89128312 C T
4: chr4 843242 G C
5: chr1 48392423 C T
6: chr3 49347898239 A G
7: chr4 8432423 G C
This needs to be done while reading in the data due to memory constraints (30,000 individuals x 200,000 variants per individual, the majority of which are not unique). Thanks in advance for any help or suggestions. This code also needs to be pretty efficient, so any pointers there are also more than welcome. If there's an easier bash solution I'm also happy to try that!
CodePudding user response:
Use do.call
in place of mapply
:
combineFunction <- function(...) {
unique(do.call(rbind, list(...)))
}
out <- foreach( ind = 1:length(ind_list), .combine = 'combineFunction', .inorder = FALSE, .multicombine = TRUE ) %do% {
ind_list[[ind]]
}
out
# CHR POS REF ALT
# <char> <num> <char> <char>
# 1: chr1 10111 A G
# 2: chr2 343243 T A
# 3: chr3 89128312 C T
# 4: chr4 843242 G C
# 5: chr1 48392423 C T
# 6: chr3 49347898239 A G
# 7: chr4 8432423 G C
I don't know that foreach
is really doing anything for you there, since there is no computation being done on separate nodes. You can get the same effect with
out <- Reduce(combineFunction, ind_list)
though I have not done memory benchmarking to prove my belief.
BTW, you should almost always use library
, not require
. The latter never stops following code when the package is not available, which is almost never what is intended. If you want to use require
, then you need to check its return value, and if false then do something like fail in an informative way ... which is what library
does. Refs: https://stackoverflow.com/a/51263513.