library(data.table)
table1 <- data.table(id1 = c(1324, 2324, 29, 29, 1010, 1010),
type = c(1, 1, 2, 1, 1, 1),
class = c("A", "A", "B", "D", "D", "A"),
number = c(1, 98, 100, 100, 70, 70))
table2 <- data.table(id2 = c(1998, 1998, 2000, 2000, 2000, 2010, 2012, 2012),
type = c(1, 1, 3, 1, 1, 5, 1, 1),
class = c("D", "A", "D", "D", "A", "B", "A", "A"),
min_number = c(34, 0, 20, 45, 5, 23, 1, 1),
max_number = c(50, 100, 100, 100, 100, 9, 10, 100))
> table1
id1 type class number
1: 1324 1 A 1
2: 2324 1 A 98
3: 29 2 B 100
4: 29 1 D 100
5: 1010 1 D 70
6: 1010 1 A 70
> table2
id2 type class min_number max_number
1: 1998 1 D 34 50
2: 1998 1 A 0 100
3: 2000 3 D 20 100
4: 2000 1 D 45 100
5: 2000 1 A 5 100
6: 2010 5 B 23 9
7: 2012 1 A 1 10
8: 2012 1 A 1 100
I have two tables, and I would like to merge them based on type
, class
, and whether number
lies between min_number
and max_number
. Then I would like to create a new variable nMatch
that stores the number of unique id2
s that match with each id1
.
setindexv(table2, c("type", "class"))
for (t1_row in seq_len(nrow(table1))) {
print(t1_row)
set(
table1, t1_row, "matches",
table2[table1[t1_row], on = c("type", "class", "max_number >= number", "min_number <= number"), .(list(id2))]
)
}
> table1[, .(nMatch = uniqueN(unlist(matches), na.rm = TRUE)), by = .(id1)]
id1 nMatch
1: 1324 2
2: 2324 3
3: 29 1
4: 1010 3
The approach above is row-by-row as suggested here, but my real dataset has millions of rows. What's another way of doing this that's faster?
CodePudding user response:
You can try data.table
with on = .(...)
to merge two data tables
na.omit(
table1[
table2,
.(id1, id2),
on = .(type, class, number >= min_number, number <= max_number),
by = .EACHI
]
)[
,
.(nMatch = uniqueN(id2)),
id1
]
and will get
id1 nMatch
1: 1324 2
2: 1010 3
3: 2324 3
4: 29 1
CodePudding user response:
An option with tidyverse
library(dplyr)
library(tidyr)
left_join(table1, table2, by =
join_by(type, class, number >= min_number, number <= max_number)) %>%
distinct(id1, id2) %>%
drop_na %>%
count(id1, name = "nMatch")
-output
id1 nMatch
<num> <int>
1: 29 1
2: 1010 3
3: 1324 2
4: 2324 3