In running the code at the bottom, I add a "total" column to data frame testDF
. I need "ID" for instances where "total" > 0. So the output I'm looking for in this example is simply 1 and 50, those ID's where "total" > 0. How would I efficiently compute this using data.table
? Noting that the actual database this will be run against has millions of rows so I'm hoping to avoid unnecessary calculations.
I include seemingly extraneous columns "Period_1", "Period_2", and "State", because when I was fooling around with data.table
subsetting, in running things like lapply(.SD, sum), by=.(ID)][, if(sum(PUR) > 0) .SD, by=ID]
, I was getting errors like
"Error in sum(Period_2) : invalid 'type' (character) of argument"
I'll use these outputs for a "join", which is something I can do in data.table
(I think).
Here's a view of the output when running the code:
Code:
library(data.table)
testDF <-
data.frame(
ID = as.numeric(c(rep(1,3),rep(50,3),rep(60,3))),
Period_1 = as.numeric(c(1:3,1:3,1:3)),
Period_2 = c("2012-06","2012-07","2012-08","2013-06","2013-07","2013-08","2012-01","2012-02","2012-03"),
PUR = as.numeric(c(rep(10,3),21:23,rep(0,3))),
CA = as.numeric(c(rep(5,3),11:13,rep(0,3))),
State = c("XX","AA","XX","AA","BB","CC","SS","XX","AA")
)
testDF_Adv <- testDF
setDT(testDF_Adv)[, total := sum(PUR CA), by=list(ID)]
testDF_Adv <- as.data.frame(testDF_Adv)
testDF_Adv
CodePudding user response:
You can simply do this:
setDT(testDF)[, if(sum(PUR CA)>0) ID,ID][,ID]
Output:
[1] 1 50
CodePudding user response:
Below, I just create a new data table by filtering out the rows with 0 totals. Are you looking to do this in one step so it will be more efficient?
library(data.table)
testDF <-
data.frame(
ID = as.numeric(c(rep(1,3),rep(50,3),rep(60,3))),
Period_1 = as.numeric(c(1:3,1:3,1:3)),
Period_2 = c("2012-06","2012-07","2012-08","2013-06","2013-07","2013-08","2012-01","2012-02","2012-03"),
PUR = as.numeric(c(rep(10,3),21:23,rep(0,3))),
CA = as.numeric(c(rep(5,3),11:13,rep(0,3))),
State = c("XX","AA","XX","AA","BB","CC","SS","XX","AA")
)
testDF_Adv <- testDF
setDT(testDF_Adv)[, total:=sum(PUR CA),by=list(ID)]
testDF2 = testDF_Adv[total>0,]
testDF2
CodePudding user response:
You can do this in one step adding additional subset in brackets.
setDT(testDF_Adv)[, total := sum(PUR CA), by=list(ID)][total > 0]
# ID Period_1 Period_2 PUR CA State total
# 1: 1 1 2012-06 10 5 XX 45
# 2: 1 2 2012-07 10 5 AA 45
# 3: 1 3 2012-08 10 5 XX 45
# 4: 50 1 2013-06 21 11 AA 102
# 5: 50 2 2013-07 22 12 BB 102
# 6: 50 3 2013-08 23 13 CC 102