library(data.table)
table1 <- data.table(id1 = c(1324, 2324, 29, 29, 1010, 1010),
type = c(1, 1, 2, 1, 1, 1),
class = c("A", "A", "B", "D", "D", "A"),
number = c(1, 98, 100, 100, 70, 70))
table2 <- data.table(id2 = c(1998, 1998, 2000, 2000, 2000, 2010, 2012, 2012),
type = c(1, 1, 3, 1, 1, 5, 1, 1),
class = c("D", "A", "D", "D", "A", "B", "A", "A"),
min_number = c(34, 0, 20, 45, 5, 23, 1, 1),
max_number = c(50, 100, 100, 100, 100, 9, 10, 100))
> table1
id1 type class number
1: 1324 1 A 1
2: 2324 1 A 98
3: 29 2 B 100
4: 29 1 D 100
5: 1010 1 D 70
6: 1010 1 A 70
> table2
id2 type class min_number max_number
1: 1998 1 D 34 50
2: 1998 1 A 0 100
3: 2000 3 D 20 100
4: 2000 1 D 45 100
5: 2000 1 A 5 100
6: 2010 5 B 23 9
7: 2012 1 A 1 10
8: 2012 1 A 1 100
Step 1. I have two tables, and I would like to merge them based on type
, class
, and whether number
lies between min_number
and max_number
.
merged <- table2[table1, on = c("type", "class", "max_number >= number", "min_number <= number")]
> merged
id2 type class min_number max_number id1
1: 1998 1 A 1 1 1324
2: 2012 1 A 1 1 1324
3: 2012 1 A 1 1 1324
4: 1998 1 A 98 98 2324
5: 2000 1 A 98 98 2324
6: 2012 1 A 98 98 2324
7: NA 2 B 100 100 29
8: 2000 1 D 100 100 29
9: 2000 1 D 70 70 1010
10: 1998 1 A 70 70 1010
11: 2000 1 A 70 70 1010
12: 2012 1 A 70 70 1010
Step 2. Then for each class
, I would like to count how many unique id1
s there are and how many unique id2
s there are. The final desired output is this:
library(dplyr)
count_merged <- merged %>% group_by(class) %>%
summarise(n_id2 = n_distinct(id2[!is.na(id2)]),
n_id1 = n_distinct(id1))
> count_merged
# A tibble: 3 × 3
class n_id2 n_id1
<chr> <int> <int>
1 A 3 3
2 B 0 1
3 D 1 2
My question: is there a faster way of doing this? If table1
and table2
have hundreds of thousands of rows, then it is extremely slow to do merge the two tables in Step 1. Is there a way to obtain the counts without merging?
CodePudding user response:
With data.table::uniqueN
:
merged[,.(nid2=uniqueN(id2,na.rm=T),nid1=uniqueN(id1)),class]
# class nid2 nid1
# <char> <int> <int>
#1: A 3 3
#2: B 0 1
#3: D 1 2
Performance comparison:
Unit: microseconds
expr min lq mean median uq max neval
dplyr 5249.1 5955.15 8626.139 6381.50 8118.95 79625.4 100
dt 675.3 876.75 1122.979 981.65 1297.50 2649.8 100