Home > Software engineering >  data.table - keep first row per group OR based on condition
data.table - keep first row per group OR based on condition

Time:10-08

I would like to keep 1st obs in group OR mpg >= 10. Is there any way we can do without creating a grouping of variables from .N?

I am looking for a solution using data.table package. I tried below but it is expecting j so get warning.

library(data.table)

x <- mtcars

setDT(x)

x[.N==1 | mpg >= 10,,by=carb]

CodePudding user response:

Try this.

Using mpg >= 50, we should get one row per carb:

x[ rowid(carb) == 1 | mpg >= 50,]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:  21.0     6 160.0   110  3.90  2.62 16.46     0     1     4     4
# 2:  22.8     4 108.0    93  3.85  2.32 18.61     1     1     4     1
# 3:  18.7     8 360.0   175  3.15  3.44 17.02     0     0     3     2
# 4:  16.4     8 275.8   180  3.07  4.07 17.40     0     0     3     3
# 5:  19.7     6 145.0   175  3.62  2.77 15.50     0     1     5     6
# 6:  15.0     8 301.0   335  3.54  3.57 14.60     0     1     5     8

Using mpg >= 30 (since all(mpg > 10)), we should get all of the above plus a few more:

x[ rowid(carb) == 1 | mpg >= 30,]
#       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#     <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#  1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
#  2:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
#  3:  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
#  4:  16.4     8 275.8   180  3.07 4.070 17.40     0     0     3     3
#  5:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
#  6:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
#  7:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
#  8:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
#  9:  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6
# 10:  15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8

An alternative, in case you need more grouping variables:

x[, .SD[seq_len(.N) == 1L | mpg >= 30,], by = carb]

though I've been informed that rowid(...) is more efficient than seq_len(.N).

CodePudding user response:

We can use .I to get the rowindex for subsetting

 x[x[, .I[seq_len(.N) == 1|mpg >= 30], by = carb]$V1]
     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
 2: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
 3: 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
 4: 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
 5: 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
 6: 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
 7: 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
 8: 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
 9: 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
10: 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
  • Related