I have a large data with more than 1 billion observations, and I need to perform some string operations which is slow.
My code is as simple as this:
DT[, var := some_function(var2)]
If I'm not mistaken, data.table
uses multithread when it is called with by
, and I'm trying to parallelize this operation utilizing this. To do so, I can make an interim grouper variable, such as
DT[, grouper := .I %/% 100]
and do
DT[, var := some_function(var2), by = grouper]
I tried some benchmarking with a small sample of data, but surprisingly I did not see a performance improvement. So my questions are:
- Does
data.table
use multithreading when it's used withby
? - If so, is there a condition that multithreading is enabled / disabled?
- Is there a way that user can "enforce"
data.table
to use multithreading here?
FYI, I see that multithreading enabled with half of my cores when I import data.table, so I guess there's no openMP issue here.
CodePudding user response:
I got answers from data.table
developers from data.table github.
Here's a summary:
Finding groups of
by
variable itself is parallelized always, but more importantly,If the function on
j
is generic (User Defined Function) then there's no parallelization.Operations on
j
is parallelized if the function is (gforce) optimized (Expressions in j which contain only the functionsmin
,max
,mean
,median
,var
,sd
,sum
,prod
,first
,last
,head
,tail
)
So, it is advised to do parallel operation manually if the function on j
is generic, but it may not always guarantee speed gain. Reference
==Solution==
In my case, I encountered vector memory exhaust when I plainly used DT[, var := some_function(var2)]
even though my server had 1TB of ram, while data was taking 200GB of memory.
I used split(DT, by='grouper')
to split my data.table
into chunks, and utilized doFuture
foreach
%dopar%
to do the job. It was pretty fast.