In the data.table
below, I want to subset only the groups (in
column x
), which have at-least 2 non-NA values.
dt <- data.table(x = c(rep("a", 5), rep("b", 3), rep("c", 3), rep("d", 3), rep("e", 4)), y = c(NA, 1, 2, 3, NA, NA, 2, 3, 1, 2, 3, NA, NA, 1, 1, 2, NA, NA))
x y
1: a NA
2: a 1
3: a 2
4: a 3
5: a NA
6: b NA
7: b 2
8: b 3
9: c 1
10: c 2
11: c 3
12: d NA
13: d NA
14: d 1
15: e 1
16: e 2
17: e NA
18: e NA
The correct solution is too slow on the large data sets.
dt[, .SD[sum(!is.na(y)) >= 2], by = "x"]
The alternate solution is fast, but I don't know how to put the condition to check for non-NA values.
dt[, if(.N >= 2L) .SD, by = "x"]
Here is the performance
microbenchmark::microbenchmark(
a = dt[, .SD[sum(!is.na(y)) >= 2], by = "x"],
b = dt[, if(.N >= 2L) .SD, by = "x"],
times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
a 1953.548 1977.380 2026.2570 2021.6775 2055.699 2276.247 100 b
b 862.626 889.073 936.8346 906.7315 918.019 3744.886 100 a
CodePudding user response:
A faster option would be to get the index with .I
and extract that row index
dt[dt[, .I[sum(!is.na(y)) >= 2], by = "x"]$V1]
-benchmarks
microbenchmark::microbenchmark(
a = dt[, .SD[sum(!is.na(y)) >= 2], by = "x"],
b = dt[dt[, .I[sum(!is.na(y)) >= 2], by = "x"]$V1],
times = 100)
Unit: microseconds
expr min lq mean median uq max neval
a 898.833 943.8020 1095.5831 1037.7235 1105.3080 4366.111 100
b 516.667 579.7975 622.3739 613.9905 661.9465 1010.461 100