say I have a large table of the following columns
subject stim1 stim2 Chosen
1: 1 2 1 2
2: 1 3 2 2
3: 1 3 1 1
4: 1 2 3 3
5: 1 1 3 1
I'm looking for an efficient way (since the full data set is large) to mutate two additional columns (by subject)
- stim1_seen, stim2_seen = is the sum of all prior instances in which the current stim1 was previously either in stim1 or stim2 (stim1_seen) or stim2 was previously in stim1 or stim2 (stim2_seen).
- stim1_chosen, stim2_chosen= is the sum of all prior instances in which the current stim1 was chosen and the current stim2 was chosen respectively.
Desired output
subject stim1 stim2 Chosen stim1_chosen stim2_chosen
1: 1 2 1 2 0 0
2: 1 3 2 2 0 1
3: 1 3 1 1 0 0
4: 1 2 3 3 2 0
5: 1 1 3 1 1 1
6: 1 2 1 1 2 2
ideally it'd be using data.table or dplyr.
here is the dput
structure(list(subject = c(1021, 1021, 1021, 1021, 1021, 1021
), stim1 = c(51L, 48L, 49L, 48L, 49L, 46L), stim2 = c(50L, 50L,
47L, 46L, 51L, 47L), Chosen = c(50L, 50L, 49L, 48L, 49L, 46L)), row.names = c(NA,
-6L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x7fc9ce8158e0>)
CodePudding user response:
Ok, this works on the sample data. It would be good to run it on some where we have more subjects and have values greater than 1 in the columns. I've assumed its a data.table
object called dt
1. Index
Really easy to change row ordering with merge
operations, so never rely on row numbers, but instead create a rowid
by subject
. .N
is data.table syntax for the length/number of rows.
# order matters, so make a rowid
dt[, rowid := 1:.N, by=subject]
# sets orders and indexing to make it quicker
setkey(dt, subject, rowid)
2. Seen cols
Need to merge the stim1
and stim2
in to one column. Do so by going from a wide to long format with melt
.
seen:=0:(.N-1)
is then grouped by these values to find the cumulative occurrences by row. But as we're looking at prior values, we subtract 1.
Then we do two merges as we're interested in comparing this with both stim cols
# for seen, melt wide to long
dt_seen <- melt(dt,
id.vars = c("subject", "rowid"),
measure.vars = c("stim1", "stim2"))
# interested in finding occurences
dt_seen <- unique(dt_seen[, .(subject, rowid, value)])
setorder(dt_seen, rowid)
dt_seen[, seen:=0:(.N-1), by=.(subject, value)]
# merge across twice
dt <- merge(dt, dt_seen,
by.x=c("subject", "rowid", "stim1"),
by.y=c("subject", "rowid", "value"),
all.x=TRUE, sort=FALSE)
setnames(dt, "seen", "stim1_seen")
dt <- merge(dt, dt_seen,
by.x=c("subject", "rowid", "stim2"),
by.y=c("subject", "rowid", "value"),
all.x=TRUE, sort=FALSE)
setnames(dt, "seen", "stim2_seen")
dt[]
3. Chosen
I've been lazy and done effectively the same as in section (2), but first filtering to rows where Chosen matches the stim value. And doing one by one instead of together as these cols are independent. The process is identical for stim1 and stim2, so could tidy it up slightly.
# turn Chosen from wide to long
dt_chosen <- melt(dt,
id.vars = c("subject", "rowid"),
measure.vars = c("Chosen"))
# interested in finding occurences
# need to expand
dt_chosen[, variable := NULL]
# going to expand the grid, so can look at e.g. value 50 for all rowids
library(tidyr)
dt_chosen[, chosen_row := 1]
dt_chosen_full <- expand(dt_chosen, nesting(subject, rowid), value) %>% setDT
# pull in the actual data and fill rest with 0's
dt_chosen_full <- merge(dt_chosen_full, dt_chosen, by=c("subject", "rowid", "value"),
all.x=TRUE)
dt_chosen_full[is.na(chosen_row), chosen_row := 0]
# use cumsum to identify now the cumulative count of these across the full row set
dt_chosen_full[, chosen := cumsum(chosen_row), by=.(subject, value)]
# as its prior, on the row itself, subtract one so the update happens after the row
dt_chosen_full[chosen_row==1, chosen := chosen-1]
# merge across twice
dt <- merge(dt, dt_chosen_full[, -"chosen_row"],
by.x=c("subject", "rowid", "stim1"),
by.y=c("subject", "rowid", "value"),
all.x=TRUE, sort=FALSE)
setnames(dt, "chosen", "stim1_chosen")
dt[is.na(stim1_chosen), stim1_chosen := 0]
dt <- merge(dt, dt_chosen_full[, -"chosen_row"],
by.x=c("subject", "rowid", "stim2"),
by.y=c("subject", "rowid", "value"),
all.x=TRUE, sort=FALSE)
setnames(dt, "chosen", "stim2_chosen")
dt[is.na(stim2_chosen), stim2_chosen := 0]
Output
dt[]
subject rowid stim2 stim1 Chosen stim1_seen stim2_seen stim1_chosen stim2_chosen
1: 1021 1 50 51 50 0 0 0 0
2: 1021 2 50 48 50 0 1 0 1
3: 1021 3 47 49 49 0 0 0 0
4: 1021 4 46 48 48 1 0 0 0
5: 1021 5 51 49 49 1 1 1 0
6: 1021 6 47 46 46 1 1 0 0
CodePudding user response:
Here's a single pipe, demonstrated on both frames.
dat1
is where you show some of the expected output
dat1[, c("stim1_seen", "stim2_seen") :=
lapply(.SD, function(z) mapply(function(x, S) {
sum(stim1[S] %in% x | stim2[S] %in% x)
}, z, lapply(seq_len(.N)-1, seq_len))),
.SDcols = c("stim1", "stim2"), by = .(subject)
][, c("stim1_chosen", "stim2_chosen") :=
lapply(.SD, function(z) mapply(function(x, S) {
sum(Chosen[S] %in% x)
}, z, lapply(seq_len(.N)-1, seq_len))),
.SDcols = c("stim1", "stim2"), by = .(subject)]
# subject stim1 stim2 Chosen stim1_seen stim2_seen stim1_chosen stim2_chosen
# <int> <int> <int> <int> <int> <int> <int> <int>
# 1: 1 2 1 2 0 0 0 0
# 2: 1 3 2 2 0 1 0 1
# 3: 1 3 1 1 1 1 0 0
# 4: 1 2 3 3 2 2 2 0
# 5: 1 1 3 1 2 3 1 1
# 6: 1 2 1 1 3 3 2 2
dat2
is your dput output (different data)
dat2[, c("stim1_seen", "stim2_seen") :=
lapply(.SD, function(z) mapply(function(x, S) {
sum(stim1[S] %in% x | stim2[S] %in% x)
}, z, lapply(seq_len(.N)-1, seq_len))),
.SDcols = c("stim1", "stim2"), by = .(subject)
][, c("stim1_chosen", "stim2_chosen") :=
lapply(.SD, function(z) mapply(function(x, S) {
sum(Chosen[S] %in% x)
}, z, lapply(seq_len(.N)-1, seq_len))),
.SDcols = c("stim1", "stim2"), by = .(subject)]
# subject stim1 stim2 Chosen stim1_seen stim2_seen stim1_chosen stim2_chosen
# <num> <int> <int> <int> <int> <int> <int> <int>
# 1: 1021 51 50 50 0 0 0 0
# 2: 1021 48 50 50 0 1 0 1
# 3: 1021 49 47 49 0 0 0 0
# 4: 1021 48 46 48 1 0 0 0
# 5: 1021 49 51 49 1 1 1 0
# 6: 1021 46 47 46 1 1 0 0
The heavy-effort here is trying to do a "cumulative %in%
". In effect, that's what mapply
is doing.
knowing that
data.table
's.N
special symbol provides the number of rows in a group, then this is useful:lapply(seq_len(.N)-1, seq_len) # [[1]] # integer(0) # [[2]] # [1] 1 # [[3]] # [1] 1 2 # [[4]] # [1] 1 2 3 # [[5]] # [1] 1 2 3 4 # [[6]] # [1] 1 2 3 4 5
This is used to index all rows before each row; that is, in row 1, there are no preceding rows, so we index on
integer(0)
; in row 5, we index on1 2 3 4
; etc.we "zip" that together (using
mapply
) along with eachstim1
(and thenstim2
value, to look for presence in the originalstim1
andstim2
columns indexed onS
(from the previous bullet), and sum the occurrencesfinally, we do this for both of the
stim*
columns by iterating over.SD
(using.SDcols
)this process is repeated (albeit more simply) on the
Chosen
column
Data
dat1 <- setDT(structure(list(subject = c(1L, 1L, 1L, 1L, 1L, 1L), stim1 = c(2L, 3L, 3L, 2L, 1L, 2L), stim2 = c(1L, 2L, 1L, 3L, 3L, 1L), Chosen = c(2L, 2L, 1L, 3L, 1L, 1L)), class = c("data.table", "data.frame"), row.names = c(NA, -6L)))
dat2 <- setDT(structure(list(subject = c(1021, 1021, 1021, 1021, 1021, 1021), stim1 = c(51L, 48L, 49L, 48L, 49L, 46L), stim2 = c(50L, 50L, 47L, 46L, 51L, 47L), Chosen = c(50L, 50L, 49L, 48L, 49L, 46L)), row.names = c(NA, -6L), class = c("data.table", "data.frame")))