Home > database >  Same j as by in datatable leads to strange outcome
Same j as by in datatable leads to strange outcome

Time:07-02

Let's keep this simple. I have the following DT:

library(data.table)
test <- setDT(data.frame(a=c(1,2,2,1,3,3,3,4,4,4),b=c(1:9,9)))

test[a == 4,sum(b),b]

gives:

   b V1
1: 8  8
2: 9  9

but why? Should the sum of the second row not be equal to 18? If I use first:

test[a == 4]

we get

   a b
1: 4 8
2: 4 9
3: 4 9

so clearly, the sum of b by the second b group should be 18. What am I missing? Note: I am not saying that command makes much sense, but I am trying to understand it at this time.

CodePudding user response:

You shouldn't sum and group with the same column b, otherwise you will always get a new column V1, whose value is same with the group column b, because each value in group column b is unique, which means you will get it self if you sum by an unique value in b (V1 = b) . For your purpose, you can group with a new variable rleid(b), which marks different groups.

dt = data.table(a=c(1,2,2,1,3,3,3,4,4,4),b=c(1:9,9)))
dt[a == 4, .(a = unique(a), b = unique(b), sumb = sum(b)), by = rleid(b)]
#   rleid     a     b  sumb
#   <int> <num> <num> <num>
#1:     1     4     8     8
#2:     2     4     9    18

P.S. You can also have a better understanding if you create a copy of column b with dt[,c:=b] and then try dt[a == 4,sum(b),c]

dt[,c := b][a == 4,sum(b),c]
#       c    V1
#   <num> <num>
#1:     8     8
#2:     9    18
  • Related