I am mapping several columns of data and would like to make it faster. Here is what I am currently doing:
datatable$col1 = mapvalues(datatable$col1, from=datatable[value==1, col5], to=datatable[value==1, col6])
datatable$col2 = mapvalues(datatable$col2, from=datatable[value==2, col5], to=datatable[value==2, col6])
datatable$col3 = mapvalues(datatable$col3, from=datatable[value==3, col5], to=datatable[value==3, col6])
datatable$col4 = mapvalues(datatable$col4, from=datatable[value==4, col5], to=datatable[value==4, col6])
Is there a faster alternative to mapvalues? Would I be able to parallelize the mapvalue() function, or just split all 4 lines across threads?
Any advice would be much appreciated. Thank you.
CodePudding user response:
In the case of data.table
, you can assign by reference with :=, in which case it should be faster since the data is modified in-place with no new copy of data be made.
Combining assignment by reference with i-join, you can achieve the same result that mapvalues
library(data.table)
set.seed=1234
n= 1e6
datatable = data.table(col1=sample(1:100, n, T),
value = sample(1:8, n, T),
col5 = sample(1:100, n, T),
col6 = -sample(1:100, n, T))
datatable$col2 <- plyr::mapvalues(datatable$col1, datatable[value==1,col5],
datatable[value==1,col6], warn_missing = F)
datatable[datatable[value==1][!duplicated(col5), .(col1,col5,col6)],
col3 := i.col6 , on=.(col1=col5)]
all(datatable$col2==datatable$col3)
# TRUE