Home > Blockchain >  data.table efficiently sum over previous rows
data.table efficiently sum over previous rows

Time:07-29

I have a dataframe

  stim1 stim2     stim_time feedback_time feedback   RT
1     2     3 1607325455244 1607325458496        1 3252
2     1     3 1607325462034 1607325464723        1 2689
3     2     1 1607325474049 1607325476997        1 2948
4     1     2 1607325480994 1607325483135        1 2141
5     3     1 1607325492735 1607325495914        1 3179
6     3     2 1607325498941 1607325501287        1 2346

My objective is to have (in an efficient manner) additional columns which calculate the cumulative RT (column6) spent on whatever is in stim1 and stim2.

For example

in row 4
stim1=1. Previously it was shown in trial 2 and trial 3. 

so left_time_spent(4)=2689 (RT(2)) 2948 RT(3) =5637

stim2 in row 4=2. 2 was shown in trial 1 and trial 3 with RT 3252 2948=6200
so right_time_spent(4)=6200

desired output

   stim1 stim2     stim_time feedback_time feedback   RT left_time_spent
1:     2     3 1607325455244 1607325458496        1 3252               0
2:     1     3 1607325462034 1607325464723        1 2689               0
3:     2     1 1607325474049 1607325476997        1 2948            3252
4:     1     2 1607325480994 1607325483135        1 2141            5637
   right_time_spent
1:                0
2:             3252
3:             2689
4:             6200

sharing shortened dput object as well. Thank you so much for all your help!

    structure(list(stim1 = c(2L, 1L, 2L, 1L), stim2 = c(3L, 3L, 1L, 
2L), stim_time = structure(c(7.94124289122202e-312, 7.94124292476908e-312, 
7.94124298413107e-312, 7.94124301844393e-312), class = "integer64"), 
    feedback_time = structure(c(7.94124290728904e-312, 7.94124293805451e-312, 
    7.94124299869612e-312, 7.94124302902187e-312), class = "integer64"), 
    feedback = c(1L, 1L, 1L, 1L), RT = structure(c(1.60670148027573e-320, 
    1.32854252166711e-320, 1.45650552393999e-320, 1.05779454774611e-320
    ), class = "integer64"), left_time_spent = c(0, 0, 3252, 
    5637), right_time_spent = c(0, 3252, 2689, 6200)), class = c("data.table", 
"data.frame"), row.names = c(NA, -4L), .internal.selfref = <pointer: 0x7f8c2600e4e0>)

CodePudding user response:

An Rcpp solution:

dualcumsum <- Rcpp::cppFunction('
  List dualcumsum(const IntegerVector& x, const IntegerVector& y, const IntegerVector& val, const int& numxy) {
    const int n = x.size();
    IntegerVector csum(numxy   1);
    IntegerVector out1(n);
    IntegerVector out2(n);
    
    for (int i = 0; i < n; i  ) {
      out1(i) = csum(x(i));
      out2(i) = csum(y(i));
      csum(x(i))  = val(i);
      csum(y(i))  = val(i);
    }
    
    return(List::create(out1, out2));
  }
')

Example use on a simplified table:

library(data.table)

dt <- data.table(stim1 = c(2L, 1L, 2L, 1L),
                stim2 = c(3L, 3L, 1L, 2L),
                RT = c(3252L, 2689L, 2948L, 2141L))
dt[, c("left_time_spent", "right_time_spent") := dualcumsum(stim1, stim2, as.integer(RT), max(c(stim1, stim2)))][]
> dt[, c("left_time_spent", "right_time_spent") := dualcumsum(stim1, stim2, as.integer(RT), max(c(stim1, stim2)))][]
   stim1 stim2   RT left_time_spent right_time_spent
1:     2     3 3252               0                0
2:     1     3 2689               0             3252
3:     2     1 2948            3252             2689
4:     1     2 2141            5637             6200

If stim1 and stim2 aren't 1- or 0-based integers, use this instead:

stims <- with(dt, unique(c(stim1, stim2)))
dt[, c("left_time_spent", "right_time_spent") := dualcumsum(match(stim1, stims), match(stim2, stims), as.integer(RT), max(stims))]
  • Related