Home > Blockchain >  Use several columns values as input to a function to compute new column in data.table
Use several columns values as input to a function to compute new column in data.table

Time:12-13

I am using data.table to process a dataset with a several columns. I need to use a few of those column values to compute a new column specific for each row. I know I can use the SDcols functionality for some simple functions. However, it's trickier when I want to use my own function, which treats the column values differently. Here is my example:

This is how the data.table looks like:

  Training Age  Client1 Stim1.0 Stim1.1  Client2 Stim2.0 Stim2.1 Choice Val.00 Val.01
1:        0   1  absence       0       0  absence       0       0      2      0      0
2:        0   2  absence       0       0  absence       0       0      2      0      0
3:        0   3 Object 1       1       1 Object 2       2       2      1      0      0
4:        0   4 Object 2       2       1 Object 2       2       1      1      0      0
5:        0   5  absence       0       0  absence       0       0      2      0      0
6:        0   6  absence       0       0  absence       0       0      2      0      0
   Val.02 alpha.0 Val.10 Val.11 Val.12 alpha.1 V25
1: 0.0000   0.005      0 0.0000 0.0000   0.005  NA
2: 0.0000   0.005      0 0.0000 0.0000   0.005  NA
3: 0.0025   0.005      0 0.0000 0.0025   0.005  NA
4: 0.0050   0.005      0 0.0025 0.0025   0.005  NA
5: 0.0050   0.005      0 0.0025 0.0025   0.005  NA
6: 0.0050   0.005      0 0.0025 0.0025   0.005  NA

The function uses the value of the columns starting with Stim to choose which of the columns starting with Val must be included in the computation of the new value.

When the number of Stim and Val columns is low, say 2 and 3 respectively, I can solve it using fcase

rawData[,`:=`(Val.Client1=fcase(Stim1.0==0,Val.00,
                               Stim1.0==1,Val.01,Stim1.0==2,Val.02) 
                fcase(Stim1.1==0,Val.10,
                      Stim1.1==1,Val.11,Stim1.1==2,Val.12),
              Val.Clien2=fcase(Stim2.0==0,Val.00,
                               Stim2.0==1,Val.01,Stim2.0==2,Val.02) 
                fcase(Stim2.1==0,Val.10,
                      Stim2.1==1,Val.11,Stim2.1==2,Val.12))]

However, the number of columns vary on different data set I am using. So, I'd like to code it independently of the number of columns.

I have managed to make it work using a combination of .SDcols and apply in this way:

numSti<-2,numFeat<-2 # parameters to know the number of columns to expect
rawData[,Val.Client1:=apply(.SD,MARGIN = 1,FUN = function(x){
# I use apply to get a vector with alll the relevant values
  x<-as.numeric(x) # for some reason I must force it to be numeric 
  Stim1.tmp<-x[1:numSti] 1 # Choose the relevant values for the Stim columns
  vals<-x[(numSti*2 1): (numSti*2 numSti*(1 numFeat))] # choose the relevant values for the Val columns
  locVal<-Stim1.tmp (numFeat 1)*(0:(numSti-1)) # map the Stim to the Val columns
  return(sum(vals[locVal])) # sum over the chosen values. 
}),.SDcols=patterns("Stim.|Val.")]

This piece of code gives me the correct computation. But it's terribly slow! Could you help me find a speedier solution?

By request of @jblood94: The output of dput(rawData)

as.data.table(structure(list(Age = 1:6, Client1 = c(2L, 2L, 0L, 1L, 2L, 2L), 
    Stim1.0 = c(0L, 0L, 1L, 2L, 0L, 0L), Stim1.1 = c(0L, 0L, 
    1L, 1L, 0L, 0L), Client2 = c(2L, 2L, 1L, 1L, 2L, 2L), Stim2.0 = c(0L, 
    0L, 2L, 2L, 0L, 0L), Stim2.1 = c(0L, 0L, 2L, 1L, 0L, 0L), 
    Choice = c(2L, 2L, 1L, 1L, 2L, 2L), Val.00 = c(0, 0, 0, 0, 
    0, 0), Val.01 = c(0, 0, 0, 0, 0, 0), Val.02 = c(0, 0, 0.0025, 
    0.005, 0.005, 0.005), alpha.0 = c(0.005, 0.005, 0.005, 0.005, 
    0.005, 0.005), Val.10 = c(0, 0, 0, 0, 0, 0), Val.11 = c(0, 
    0, 0, 0.0025, 0.0025, 0.0025), Val.12 = c(0, 0, 0.0025, 0.0025, 
    0.0025, 0.0025), alpha.1 = c(0.005, 0.005, 0.005, 0.005, 
    0.005, 0.005), V25 = c(NA, NA, NA, NA, NA, NA)), row.names = c(NA, 
-6L), class = c("data.table", "data.frame")))

CodePudding user response:

This seems to be working. The idea is to index the Val columns based on the Stim values, put the result into an array of dimensions c(nrow(rawData), numSti, 2), then permute the array in order to sum along the second dimension with colSums.

numSti <- 2
numFeat <- 2

rawData[
  ,paste0("Val.Client", 1:2) := as.data.table(
    colSums(
      aperm(
        array(
          unlist(rawData[,.SD , .SDcols = grep("Val.", names(rawData), value = TRUE)])[
            1:.N   (unlist(rawData[,.SD , .SDcols = grep("Stim", names(rawData), value = TRUE)])   rep(c(0, numFeat   1), each = .N))*.N
          ],
          c(.N, numSti, 2)
        ),
        c(2, 1, 3)
      )
    )
  )
]

rawData[, Val.Client1:Val.Client2]
#>    Val.Client1 Val.Client2
#> 1:      0.0000      0.0000
#> 2:      0.0000      0.0000
#> 3:      0.0000      0.0050
#> 4:      0.0075      0.0075
#> 5:      0.0000      0.0000
#> 6:      0.0000      0.0000

It can also be generalized for the number of clients:

numCli <- 2

rawData[
  ,paste0("Val.Client", 1:numCli) := as.data.table(
    colSums(
      aperm(
        array(
          unlist(rawData[,.SD , .SDcols = grep("Val.", names(rawData), value = TRUE)])[
            1:.N   (
              unlist(
                rawData[,.SD , .SDcols = grep("Stim", names(rawData), value = TRUE)]
              )   rep(
                seq(0, by = numFeat   1, length.out = numCli),
                each = .N
              )
            )*.N
          ],
          c(.N, numSti, numCli)
        ),
        c(2, 1, 3)
      )
    )
  )
]

CodePudding user response:

Perhaps this user function will help out:

fun <- function(data, vals) {
  stimvals <- Map(function(V, levels) {
    match(paste0(sub("Stim[0-9] \\.([0-9] )", "Val.\\1", V), levels),
          names(data))
  }, setNames(nm = vals), lapply(vals, function(z) data[[z]]))
  Reduce(` `, lapply(stimvals, function(z) as.data.frame(data)[cbind(seq_along(z), z)]))
}

stims <- grep("Stim.*", names(rawData), value = TRUE)
stims <- split(stims, sub("\\..*", "", stims))
names(stims) <- sub(".*([0-9] )$", "Val.Client\\1", names(stims))
stims
# $Val.Client1
# [1] "Stim1.0" "Stim1.1"
# $Val.Client2
# [1] "Stim2.0" "Stim2.1"

rawData[, names(stims) := lapply(stims, fun, data = .SD)]
rawData
#      Age Client1 Stim1.0 Stim1.1 Client2 Stim2.0 Stim2.1 Choice Val.00 Val.01 Val.02 alpha.0 Val.10 Val.11 Val.12 alpha.1    V25 Val.Client1 Val.Client2
#    <int>   <int>   <int>   <int>   <int>   <int>   <int>  <int>  <num>  <num>  <num>   <num>  <num>  <num>  <num>   <num> <lgcl>       <num>       <num>
# 1:     1       2       0       0       2       0       0      2      0      0 0.0000   0.005      0 0.0000 0.0000   0.005     NA      0.0000      0.0000
# 2:     2       2       0       0       2       0       0      2      0      0 0.0000   0.005      0 0.0000 0.0000   0.005     NA      0.0000      0.0000
# 3:     3       0       1       1       1       2       2      1      0      0 0.0025   0.005      0 0.0000 0.0025   0.005     NA      0.0000      0.0050
# 4:     4       1       2       1       1       2       1      1      0      0 0.0050   0.005      0 0.0025 0.0025   0.005     NA      0.0075      0.0075
# 5:     5       2       0       0       2       0       0      2      0      0 0.0050   0.005      0 0.0025 0.0025   0.005     NA      0.0000      0.0000
# 6:     6       2       0       0       2       0       0      2      0      0 0.0050   0.005      0 0.0025 0.0025   0.005     NA      0.0000      0.0000

This is dynamically finding Val* names based on the Stim* sublevel (e.g., .1 of Stim2.1) and the values in the corresponding Stim2.1. That is, if Stim2.1 has a value of 0, then it should pull from Val.10. The data is indexed based on the appropriate Val* columns and then assigned back into a name derived from the first number of the Stim* name (2 of Stim2.1).

So the generation of the stims variable above is the key: group the corresponding Stim#.#* variables together (they will be subsetted/summed) and name them appropriately.

  • Related