HAVE = data.frame(INSTRUCTOR = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3),
STUDENT = c(1, 2, 2, 2, 1, 3, 1, 1, 1, 1, 2, 1),
SCORE = c(10, 1, 0, 0, 7, 3, 5, 2, 2, 4, 10, 2),
TIME = c(1,1,2,3,2,1,1,2,3,1,1,2))
WANT = data.frame(INSTRUCTOR = c(1, 2, 3),
SCORE.DIF = c(-9, NA, 6))
For each INSTRUCTOR, I wish to find the SCORE of the first and second STUDENT, and subtract their scores. The STUDENT code varies so I wish not to use '==1' vs '==2'
I try:
HAVE[, .SD[1:2], by = 'INSTRUCTOR']
but do not know how to subtract vertically and obtain 'WANT' data frame from 'HAVE'
CodePudding user response:
library(data.table)
setDT(HAVE)
unique(HAVE, by = c("INSTRUCTOR", "STUDENT")
)[, .(SCORE.DIF = diff(SCORE[1:2])), by = INSTRUCTOR]
# INSTRUCTOR SCORE.DIF
# <num> <num>
# 1: 1 -9
# 2: 2 NA
# 3: 3 6
To use your new TIME
variable, we can do
HAVE[, .SD[which.min(TIME),], by = .(INSTRUCTOR, STUDENT)
][, .(SCORE.DIF = diff(SCORE[1:2])), by = INSTRUCTOR]
# INSTRUCTOR SCORE.DIF
# <num> <num>
# 1: 1 -9
# 2: 2 NA
# 3: 3 6
One might be tempted to replace SCORE[1:2]
with head(SCORE,2)
, but that won't work: head(SCORE,2)
will return length-1 if the input is length-2, as it is with instructor 2 (who only has one student albeit multiple times). When you run diff
on length-1 (e.g., diff(1)
), it returns a 0-length vector, which in the above data.table
code reduces to zero rows for instructor 2. However, when there is only one student, SCORE[1:2]
resolves to c(SCORE[1], NA)
, for which the diff is length-1 (as needed) and NA
(as needed).