table1 <- data.frame(id1 = c(1324, 1111, 2324, 2324, 1111, 1111),
class = c(1, 1, 1, 2, 2, 1),
type = c("A", "A", "A", "B", "C", "D"),
number = c(1, 2.5, 98, 100, 80, 50))
> table1
id1 class type number
1 1324 1 A 1.0
2 1111 1 A 2.5
3 2324 1 A 98.0
4 2324 2 B 100.0
5 1111 2 C 80.0
6 1111 1 D 50.0
table2 <- data.frame(id2 = c(1992, 1987, 1998, 1998, 2000, 2010, 2012),
class = c(3, 3, 1, 1, 3, 1, 2),
type = c("B", "C", "D", "A", "D", "D", "C"),
min_number = c(0, 0, 34, 0, 20, 45, 5),
max_number = c(18, 18, 50, 100, 100, 100, 100))
> table2
id2 class type min_number max_number
1 1992 3 B 0 18
2 1987 3 C 0 18
3 1998 1 D 34 50
4 1998 1 A 0 100
5 2000 3 D 20 100
6 2010 1 D 45 100
7 2012 2 C 5 100
I have 2 tables. For each row in table1
, I want to find the rows that match it in table2
and store the corresponding id2s
. That is, I want to match on class
, type
, and whether or not min_number <= number <= max_number
.
I wrote the for loop
below to do this:
id2_list <- vector("list")
for(i in 1:nrow(table1)){
id2_list[[i]] <- table2[which((table2$class == table1$class[i])
& (table2$type %in% table1$type[i])
& ((table2$min_number <= table1$number[i])
& (table2$max_number >= table1$number[i]))), ]$id2
}
> id2_list
[[1]]
[1] 1998
[[2]]
[1] 1998
[[3]]
[1] 1998
[[4]]
numeric(0)
[[5]]
[1] 2012
[[6]]
[1] 1998 2010
This list contains the id2
s that match each of the 6 id1
s in table1
. It's possible for a row in table1
to match 0, 1, or more rows in table2
.
Is there a faster way to do this without using a for loop
? If I have millions of rows in either table. I think this would be very inefficient.
CodePudding user response:
Yes, you can do this with a pmap
and filter
inside a mutate as such:
library(tidyverse)
table1 %>%
mutate(matches = pmap(list(class, type, number),
function(cl, ty, nu) filter(table2,
class == cl,
type == ty,
nu >= min_number,
nu <= max_number)$id2))
# id1 class type number matches
# 1 1324 1 A 1.0 1998
# 2 1111 1 A 2.5 1998
# 3 2324 1 A 98.0 1998
# 4 2324 2 B 100.0
# 5 1111 2 C 80.0 2012
# 6 1111 1 D 50.0 1998, 2010
For speed, you can try using the furrr
package and future_pmap
, or you can use multidplyr
.
With furrr
:
library(tidyverse)
library(furrr)
plan(multisession, workers = parallel::detectCores() - 1)
table1 %>%
mutate(matches = future_pmap(list(class, type, number),
function(cl, ty, nu) filter(table2,
class == cl,
type == ty,
nu >= min_number,
nu <= max_number)$id2))
# id1 class type number matches
# 1 1324 1 A 1.0 1998
# 2 1111 1 A 2.5 1998
# 3 2324 1 A 98.0 1998
# 4 2324 2 B 100.0
# 5 1111 2 C 80.0 2012
# 6 1111 1 D 50.0 1998, 2010