Home > other >  How to filter and subset data by group using data.table?
How to filter and subset data by group using data.table?

Time:11-21

In running the code at the bottom, I add a "total" column to data frame testDF. I need "ID" for instances where "total" > 0. So the output I'm looking for in this example is simply 1 and 50, those ID's where "total" > 0. How would I efficiently compute this using data.table? Noting that the actual database this will be run against has millions of rows so I'm hoping to avoid unnecessary calculations.

I include seemingly extraneous columns "Period_1", "Period_2", and "State", because when I was fooling around with data.table subsetting, in running things like lapply(.SD, sum), by=.(ID)][, if(sum(PUR) > 0) .SD, by=ID], I was getting errors like "Error in sum(Period_2) : invalid 'type' (character) of argument"

I'll use these outputs for a "join", which is something I can do in data.table (I think).

Here's a view of the output when running the code:

enter image description here

Code:

library(data.table)

testDF <-
  data.frame(
    ID = as.numeric(c(rep(1,3),rep(50,3),rep(60,3))),
    Period_1 = as.numeric(c(1:3,1:3,1:3)),
    Period_2 = c("2012-06","2012-07","2012-08","2013-06","2013-07","2013-08","2012-01","2012-02","2012-03"),
    PUR = as.numeric(c(rep(10,3),21:23,rep(0,3))),
    CA = as.numeric(c(rep(5,3),11:13,rep(0,3))),
    State = c("XX","AA","XX","AA","BB","CC","SS","XX","AA")
  )

testDF_Adv <- testDF
setDT(testDF_Adv)[, total := sum(PUR   CA), by=list(ID)]
testDF_Adv <- as.data.frame(testDF_Adv)
testDF_Adv

CodePudding user response:

You can simply do this:

setDT(testDF)[, if(sum(PUR CA)>0) ID,ID][,ID]

Output:

[1]  1 50

CodePudding user response:

Below, I just create a new data table by filtering out the rows with 0 totals. Are you looking to do this in one step so it will be more efficient?

library(data.table)

testDF <-
  data.frame(
    ID = as.numeric(c(rep(1,3),rep(50,3),rep(60,3))),
    Period_1 = as.numeric(c(1:3,1:3,1:3)),
    Period_2 = c("2012-06","2012-07","2012-08","2013-06","2013-07","2013-08","2012-01","2012-02","2012-03"),
    PUR = as.numeric(c(rep(10,3),21:23,rep(0,3))),
    CA = as.numeric(c(rep(5,3),11:13,rep(0,3))),
    State = c("XX","AA","XX","AA","BB","CC","SS","XX","AA")
  )

testDF_Adv <- testDF
setDT(testDF_Adv)[, total:=sum(PUR CA),by=list(ID)]
testDF2 = testDF_Adv[total>0,]
testDF2

CodePudding user response:

You can do this in one step adding additional subset in brackets.

setDT(testDF_Adv)[, total := sum(PUR   CA), by=list(ID)][total > 0]
#    ID Period_1 Period_2 PUR CA State total
# 1:  1        1  2012-06  10  5    XX    45
# 2:  1        2  2012-07  10  5    AA    45
# 3:  1        3  2012-08  10  5    XX    45
# 4: 50        1  2013-06  21 11    AA   102
# 5: 50        2  2013-07  22 12    BB   102
# 6: 50        3  2013-08  23 13    CC   102
  • Related