Add column that sums all previous rows that meet condition-CodePudding

say I have a large table of the following columns

   subject stim1 stim2 Chosen
1:       1     2     1      2
2:       1     3     2      2
3:       1     3     1      1
4:       1     2     3      3
5:       1     1     3      1

I'm looking for an efficient way (since the full data set is large) to mutate two additional columns (by subject)

stim1_seen, stim2_seen = is the sum of all prior instances in which the current stim1 was previously either in stim1 or stim2 (stim1_seen) or stim2 was previously in stim1 or stim2 (stim2_seen).
stim1_chosen, stim2_chosen= is the sum of all prior instances in which the current stim1 was chosen and the current stim2 was chosen respectively.

Desired output

     subject stim1 stim2 Chosen  stim1_chosen   stim2_chosen
1:       1     2     1      2         0               0
2:       1     3     2      2         0               1
3:       1     3     1      1         0               0
4:       1     2     3      3         2               0
5:       1     1     3      1         1               1
6:       1     2     1      1         2               2

ideally it'd be using data.table or dplyr.

here is the dput

structure(list(subject = c(1021, 1021, 1021, 1021, 1021, 1021
), stim1 = c(51L, 48L, 49L, 48L, 49L, 46L), stim2 = c(50L, 50L, 
47L, 46L, 51L, 47L), Chosen = c(50L, 50L, 49L, 48L, 49L, 46L)), row.names = c(NA, 
-6L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x7fc9ce8158e0>)

CodePudding user response：

Ok, this works on the sample data. It would be good to run it on some where we have more subjects and have values greater than 1 in the columns. I've assumed its a data.table object called dt

1. Index

Really easy to change row ordering with merge operations, so never rely on row numbers, but instead create a rowid by subject. .N is data.table syntax for the length/number of rows.

# order matters, so make a rowid
dt[, rowid := 1:.N, by=subject]

# sets orders and indexing to make it quicker
setkey(dt, subject, rowid)

2. Seen cols

Need to merge the stim1 and stim2 in to one column. Do so by going from a wide to long format with melt. seen:=0:(.N-1) is then grouped by these values to find the cumulative occurrences by row. But as we're looking at prior values, we subtract 1.

Then we do two merges as we're interested in comparing this with both stim cols

# for seen, melt wide to long
dt_seen <- melt(dt, 
                id.vars = c("subject", "rowid"), 
                measure.vars = c("stim1", "stim2"))
# interested in finding occurences
dt_seen <- unique(dt_seen[, .(subject, rowid, value)])
setorder(dt_seen, rowid)
dt_seen[, seen:=0:(.N-1), by=.(subject, value)]

# merge across twice
dt <- merge(dt, dt_seen, 
            by.x=c("subject", "rowid", "stim1"), 
            by.y=c("subject", "rowid", "value"), 
            all.x=TRUE, sort=FALSE)
setnames(dt, "seen", "stim1_seen")
dt <- merge(dt, dt_seen, 
            by.x=c("subject", "rowid", "stim2"), 
            by.y=c("subject", "rowid", "value"), 
            all.x=TRUE, sort=FALSE)
setnames(dt, "seen", "stim2_seen")
dt[]

3. Chosen

I've been lazy and done effectively the same as in section (2), but first filtering to rows where Chosen matches the stim value. And doing one by one instead of together as these cols are independent. The process is identical for stim1 and stim2, so could tidy it up slightly.

# turn Chosen from wide to long
dt_chosen <- melt(dt,
                  id.vars = c("subject", "rowid"), 
                  measure.vars = c("Chosen"))
# interested in finding occurences
# need to expand 
dt_chosen[, variable := NULL]
# going to expand the grid, so can look at e.g. value 50 for all rowids
library(tidyr)
dt_chosen[, chosen_row := 1]
dt_chosen_full <- expand(dt_chosen, nesting(subject, rowid), value) %>% setDT
# pull in the actual data and fill rest with 0's
dt_chosen_full <- merge(dt_chosen_full, dt_chosen, by=c("subject", "rowid", "value"),
                        all.x=TRUE)
dt_chosen_full[is.na(chosen_row), chosen_row := 0]
# use cumsum to identify now the cumulative count of these across the full row set
dt_chosen_full[, chosen := cumsum(chosen_row), by=.(subject, value)]
# as its prior, on the row itself, subtract one so the update happens after the row
dt_chosen_full[chosen_row==1, chosen := chosen-1]

# merge across twice
dt <- merge(dt, dt_chosen_full[, -"chosen_row"], 
            by.x=c("subject", "rowid", "stim1"), 
            by.y=c("subject", "rowid", "value"), 
            all.x=TRUE, sort=FALSE)
setnames(dt, "chosen", "stim1_chosen")
dt[is.na(stim1_chosen), stim1_chosen := 0]

dt <- merge(dt, dt_chosen_full[, -"chosen_row"], 
            by.x=c("subject", "rowid", "stim2"), 
            by.y=c("subject", "rowid", "value"), 
            all.x=TRUE, sort=FALSE)
setnames(dt, "chosen", "stim2_chosen")
dt[is.na(stim2_chosen), stim2_chosen := 0]

Output

dt[]
   subject rowid stim2 stim1 Chosen stim1_seen stim2_seen stim1_chosen stim2_chosen
1:    1021     1    50    51     50          0          0            0            0
2:    1021     2    50    48     50          0          1            0            1
3:    1021     3    47    49     49          0          0            0            0
4:    1021     4    46    48     48          1          0            0            0
5:    1021     5    51    49     49          1          1            1            0
6:    1021     6    47    46     46          1          1            0            0

CodePudding user response：

Here's a single pipe, demonstrated on both frames.

dat1 is where you show some of the expected output

dat1[, c("stim1_seen", "stim2_seen") :=
         lapply(.SD, function(z) mapply(function(x, S) {
           sum(stim1[S] %in% x | stim2[S] %in% x)
         }, z, lapply(seq_len(.N)-1, seq_len))),
     .SDcols = c("stim1", "stim2"), by = .(subject)
     ][, c("stim1_chosen", "stim2_chosen") :=
           lapply(.SD, function(z) mapply(function(x, S) {
             sum(Chosen[S] %in% x)
           }, z, lapply(seq_len(.N)-1, seq_len))),
       .SDcols = c("stim1", "stim2"), by = .(subject)]
#    subject stim1 stim2 Chosen stim1_seen stim2_seen stim1_chosen stim2_chosen
#      <int> <int> <int>  <int>      <int>      <int>        <int>        <int>
# 1:       1     2     1      2          0          0            0            0
# 2:       1     3     2      2          0          1            0            1
# 3:       1     3     1      1          1          1            0            0
# 4:       1     2     3      3          2          2            2            0
# 5:       1     1     3      1          2          3            1            1
# 6:       1     2     1      1          3          3            2            2

dat2 is your dput output (different data)

dat2[, c("stim1_seen", "stim2_seen") :=
         lapply(.SD, function(z) mapply(function(x, S) {
           sum(stim1[S] %in% x | stim2[S] %in% x)
         }, z, lapply(seq_len(.N)-1, seq_len))),
     .SDcols = c("stim1", "stim2"), by = .(subject)
     ][, c("stim1_chosen", "stim2_chosen") :=
           lapply(.SD, function(z) mapply(function(x, S) {
             sum(Chosen[S] %in% x)
           }, z, lapply(seq_len(.N)-1, seq_len))),
       .SDcols = c("stim1", "stim2"), by = .(subject)]
#    subject stim1 stim2 Chosen stim1_seen stim2_seen stim1_chosen stim2_chosen
#      <num> <int> <int>  <int>      <int>      <int>        <int>        <int>
# 1:    1021    51    50     50          0          0            0            0
# 2:    1021    48    50     50          0          1            0            1
# 3:    1021    49    47     49          0          0            0            0
# 4:    1021    48    46     48          1          0            0            0
# 5:    1021    49    51     49          1          1            1            0
# 6:    1021    46    47     46          1          1            0            0

The heavy-effort here is trying to do a "cumulative %in%". In effect, that's what mapply is doing.

knowing that data.table's .N special symbol provides the number of rows in a group, then this is useful:
```
lapply(seq_len(.N)-1, seq_len)
# [[1]]
# integer(0)
# [[2]]
# [1] 1
# [[3]]
# [1] 1 2
# [[4]]
# [1] 1 2 3
# [[5]]
# [1] 1 2 3 4
# [[6]]
# [1] 1 2 3 4 5
```
This is used to index all rows before each row; that is, in row 1, there are no preceding rows, so we index on integer(0); in row 5, we index on 1 2 3 4; etc.
we "zip" that together (using mapply) along with each stim1 (and then stim2 value, to look for presence in the original stim1 and stim2 columns indexed on S (from the previous bullet), and sum the occurrences
finally, we do this for both of the stim* columns by iterating over .SD (using .SDcols)
this process is repeated (albeit more simply) on the Chosen column

Data

dat1 <- setDT(structure(list(subject = c(1L, 1L, 1L, 1L, 1L, 1L), stim1 = c(2L, 3L, 3L, 2L, 1L, 2L), stim2 = c(1L, 2L, 1L, 3L, 3L, 1L), Chosen = c(2L, 2L, 1L, 3L, 1L, 1L)), class = c("data.table", "data.frame"), row.names = c(NA, -6L)))
dat2 <- setDT(structure(list(subject = c(1021, 1021, 1021, 1021, 1021, 1021), stim1 = c(51L, 48L, 49L, 48L, 49L, 46L), stim2 = c(50L, 50L, 47L, 46L, 51L, 47L), Chosen = c(50L, 50L, 49L, 48L, 49L, 46L)), row.names = c(NA, -6L), class = c("data.table", "data.frame")))