Faster way to map values than plyr::mapvalues() in R?-CodePudding

I am mapping several columns of data and would like to make it faster. Here is what I am currently doing:

datatable$col1 = mapvalues(datatable$col1, from=datatable[value==1, col5], to=datatable[value==1, col6])
datatable$col2 = mapvalues(datatable$col2, from=datatable[value==2, col5], to=datatable[value==2, col6])
datatable$col3 = mapvalues(datatable$col3, from=datatable[value==3, col5], to=datatable[value==3, col6])
datatable$col4 = mapvalues(datatable$col4, from=datatable[value==4, col5], to=datatable[value==4, col6])

Is there a faster alternative to mapvalues? Would I be able to parallelize the mapvalue() function, or just split all 4 lines across threads?

Any advice would be much appreciated. Thank you.

CodePudding user response：

In the case of data.table, you can assign by reference with :=, in which case it should be faster since the data is modified in-place with no new copy of data be made.

Combining assignment by reference with i-join, you can achieve the same result that mapvalues


library(data.table)

set.seed=1234
n= 1e6
datatable = data.table(col1=sample(1:100, n, T),
                       value = sample(1:8, n, T),
                       col5 = sample(1:100, n, T),
                       col6 = -sample(1:100, n, T))

datatable$col2 <- plyr::mapvalues(datatable$col1, datatable[value==1,col5], 
                                  datatable[value==1,col6], warn_missing = F)

datatable[datatable[value==1][!duplicated(col5), .(col1,col5,col6)],
          col3 := i.col6 , on=.(col1=col5)]

all(datatable$col2==datatable$col3)
# TRUE