How to merge two tables and then tabulate counts in one step using data.table-CodePudding

library(data.table)
table1 <- data.table(id1 = c(1324, 2324, 29, 29, 1010, 1010),
                     type = c(1, 1, 2, 1, 1, 1),
                     class = c("A",  "A", "B", "D", "D", "A"),
                     number = c(1, 98, 100, 100, 70, 70))
table2 <- data.table(id2 = c(1998, 1998, 2000, 2000, 2000, 2010, 2012, 2012),
                     type = c(1, 1, 3, 1, 1, 5, 1, 1),
                     class = c("D", "A", "D", "D", "A", "B", "A", "A"),
                     min_number = c(34, 0, 20, 45, 5, 23, 1, 1),
                     max_number = c(50, 100, 100, 100, 100, 9, 10, 100))
> table1
    id1 type class number
1: 1324    1     A      1
2: 2324    1     A     98
3:   29    2     B    100
4:   29    1     D    100
5: 1010    1     D     70
6: 1010    1     A     70
> table2
    id2 type class min_number max_number
1: 1998    1     D         34         50
2: 1998    1     A          0        100
3: 2000    3     D         20        100
4: 2000    1     D         45        100
5: 2000    1     A          5        100
6: 2010    5     B         23          9
7: 2012    1     A          1         10
8: 2012    1     A          1        100

Step 1. I have two tables, and I would like to merge them based on type, class, and whether number lies between min_number and max_number.

merged <- table2[table1, on = c("type", "class", "max_number >= number", "min_number <= number")]
> merged
     id2 type class min_number max_number  id1
 1: 1998    1     A          1          1 1324
 2: 2012    1     A          1          1 1324
 3: 2012    1     A          1          1 1324
 4: 1998    1     A         98         98 2324
 5: 2000    1     A         98         98 2324
 6: 2012    1     A         98         98 2324
 7:   NA    2     B        100        100   29
 8: 2000    1     D        100        100   29
 9: 2000    1     D         70         70 1010
10: 1998    1     A         70         70 1010
11: 2000    1     A         70         70 1010
12: 2012    1     A         70         70 1010

Step 2. Then for each class, I would like to count how many unique id1s there are and how many unique id2s there are. The final desired output is this:

library(dplyr)
count_merged <- merged %>% group_by(class) %>% 
  summarise(n_id2 = n_distinct(id2[!is.na(id2)]),
            n_id1 = n_distinct(id1))
> count_merged
# A tibble: 3 × 3
  class n_id2 n_id1
  <chr> <int> <int>
1 A         3     3
2 B         0     1
3 D         1     2

My question: is there a faster way of doing this? If table1 and table2 have hundreds of thousands of rows, then it is extremely slow to do merge the two tables in Step 1. Is there a way to obtain the counts without merging?

CodePudding user response：

With data.table::uniqueN:

merged[,.(nid2=uniqueN(id2,na.rm=T),nid1=uniqueN(id1)),class]

#    class  nid2  nid1
#   <char> <int> <int>
#1:      A     3     3
#2:      B     0     1
#3:      D     1     2

Performance comparison:

Unit: microseconds
  expr    min      lq     mean  median      uq     max neval
 dplyr 5249.1 5955.15 8626.139 6381.50 8118.95 79625.4   100
    dt  675.3  876.75 1122.979  981.65 1297.50  2649.8   100