I'm trying to add a cumsum()
column to a dataframe using the data.table package per the below code and it doesn't work in this case (see section marked # this doesn't work, after adding cumsum at the bottom:
). I've used cumsum()
before but as standalone function in setDT()
, but now I'm trying to chain it to the code and it doesn't work. What am I doing wrong?
Here are the results I get for the code that does work (marked # this works, before cumsum:
in the code below), except I added a column to the right "cumX" which is what I tried adding using chaining, whereby "cumX" runs a cumulative sum of column "1stStateX":
Period 1stStateX cumX
1: 1 0 0
2: 2 1 1
3: 3 1 2
4: 4 0 2
5: 5 0 2
In contrast, when I run the code marked # this doesn't work...
, I get nothing back in the R studio console as illustrated here:
Also, when I run the code below marked # this works, before cumsum:
for the simple sum()
function (which works as intended), I noticed that the original dataframe testDF
is altered as illustrated below. Why is that and how do I prevent that from happening?
Code:
library(data.table)
testDF <-
data.frame(
ID = c(rep(10,5),rep(50,5),rep(60,5)),
Period = c(1:5,1:5,1:5),
State = c("A","B","X","X","X",
"A","A","A","A","A",
"A","X","A","X","B")
)
# this works, before cumsum:
setDT(testDF)[
, `1stStateX` := .I == .I[State == 'X'][1],ID][
, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period]
# this doesn't work, after adding cumsum at the bottom:
setDT(testDF)[
, `1stStateX` := .I == .I[State == 'X'][1],ID][
, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period][
, cumX := cumsum(`1stStateX`),by = Period]
Final solution using ismirsehregal´s answer, and after studying the recommended vignette I got chaining to work:
library(data.table)
testDF <- data.frame(
ID = c(rep(10,5),rep(50,5),rep(60,5)),
Period = c(1:5,1:5,1:5),
State = c("A","B","X","X","X",
"A","A","A","A","A",
"A","X","A","X","B")
)
testDT <- testDF
testDT <- setDT(testDT)[, `1stStateX` := .I == .I[State == 'X'][1],ID][
, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period][
, cumX := cumsum(`1stStateX`)]
testDT <- as.data.frame(testDT)
print(testDT)
CodePudding user response:
I don't know your expected output, however, I guess your confusion is the result of chaining code that modifies an existing data.table
object (Please see ?`:=`
) and code which results in a new data.table
object (Please see section d) Select column(s) in j
here).
Please check the following:
library(data.table)
testDF <- data.frame(
ID = c(rep(10,5),rep(50,5),rep(60,5)),
Period = c(1:5,1:5,1:5),
State = c("A","B","X","X","X",
"A","A","A","A","A",
"A","X","A","X","B")
)
DT <- copy(testDF)
setDT(DT)
DT[, `1stStateX` := .I == .I[State == 'X'][1], ID] # this step changes DT by reference, please see ?`:=`
aggregatedDT <- DT[, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period] # this does not change DT and results in another data.table
aggregatedDT[, cumX := cumsum(`1stStateX`)]
print(aggregatedDT)
Result:
Period 1stStateX cumX
1: 1 0 0
2: 2 1 1
3: 3 1 2
4: 4 0 2
5: 5 0 2