Home > Software engineering >  Why does cumsum function not work in data.table package when attempting chaining?
Why does cumsum function not work in data.table package when attempting chaining?

Time:11-26

I'm trying to add a cumsum() column to a dataframe using the data.table package per the below code and it doesn't work in this case (see section marked # this doesn't work, after adding cumsum at the bottom:). I've used cumsum() before but as standalone function in setDT(), but now I'm trying to chain it to the code and it doesn't work. What am I doing wrong?

Here are the results I get for the code that does work (marked # this works, before cumsum: in the code below), except I added a column to the right "cumX" which is what I tried adding using chaining, whereby "cumX" runs a cumulative sum of column "1stStateX":

   Period 1stStateX   cumX
1:      1         0      0
2:      2         1      1
3:      3         1      2
4:      4         0      2
5:      5         0      2

In contrast, when I run the code marked # this doesn't work..., I get nothing back in the R studio console as illustrated here:

enter image description here

Also, when I run the code below marked # this works, before cumsum: for the simple sum() function (which works as intended), I noticed that the original dataframe testDF is altered as illustrated below. Why is that and how do I prevent that from happening?

enter image description here

Code:

library(data.table)

testDF <-
  data.frame(
    ID = c(rep(10,5),rep(50,5),rep(60,5)),
    Period = c(1:5,1:5,1:5),
    State = c("A","B","X","X","X",
              "A","A","A","A","A",
              "A","X","A","X","B")
  )

# this works, before cumsum:
setDT(testDF)[
  , `1stStateX` := .I == .I[State == 'X'][1],ID][
    , .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period]


# this doesn't work, after adding cumsum at the bottom:
setDT(testDF)[
  , `1stStateX` := .I == .I[State == 'X'][1],ID][
    , .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period][
      , cumX := cumsum(`1stStateX`),by = Period]

Final solution using ismirsehregal´s answer, and after studying the recommended vignette I got chaining to work:

library(data.table)

testDF <- data.frame(
  ID = c(rep(10,5),rep(50,5),rep(60,5)),
  Period = c(1:5,1:5,1:5),
  State = c("A","B","X","X","X",
            "A","A","A","A","A",
            "A","X","A","X","B")
)

testDT <- testDF
testDT <- setDT(testDT)[, `1stStateX` := .I == .I[State == 'X'][1],ID][
    , .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period][
      , cumX := cumsum(`1stStateX`)]
testDT <- as.data.frame(testDT)
print(testDT)

CodePudding user response:

I don't know your expected output, however, I guess your confusion is the result of chaining code that modifies an existing data.table object (Please see ?`:=`) and code which results in a new data.table object (Please see section d) Select column(s) in j here).

Please check the following:

library(data.table)

testDF <- data.frame(
    ID = c(rep(10,5),rep(50,5),rep(60,5)),
    Period = c(1:5,1:5,1:5),
    State = c("A","B","X","X","X",
              "A","A","A","A","A",
              "A","X","A","X","B")
  )

DT <- copy(testDF)
setDT(DT)

DT[, `1stStateX` := .I == .I[State == 'X'][1], ID] # this step changes DT by reference, please see ?`:=`
aggregatedDT <- DT[, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period] # this does not change DT and results in another data.table

aggregatedDT[, cumX := cumsum(`1stStateX`)]
print(aggregatedDT)

Result:

   Period 1stStateX cumX
1:      1         0    0
2:      2         1    1
3:      3         1    2
4:      4         0    2
5:      5         0    2
  • Related