Home > Software design >  How to return indices of matching rows between two data.frames in R
How to return indices of matching rows between two data.frames in R

Time:08-24

table1 <- data.frame(id1 = c(1324, 1111, 2324, 2324, 1111, 1111),
                     class = c(1, 1, 1, 2, 2, 1),
                     type = c("A", "A", "A", "B", "C", "D"),
                     number = c(1, 2.5, 98, 100, 80, 50))
> table1
   id1 class type number
1 1324     1    A    1.0
2 1111     1    A    2.5
3 2324     1    A   98.0
4 2324     2    B  100.0
5 1111     2    C   80.0
6 1111     1    D   50.0

table2 <- data.frame(id2 = c(1992, 1987, 1998, 1998, 2000, 2010, 2012),
                     class = c(3, 3, 1, 1, 3, 1, 2),
                     type = c("B", "C", "D", "A", "D", "D", "C"),
                     min_number = c(0, 0, 34, 0, 20, 45, 5),
                     max_number = c(18, 18, 50, 100, 100, 100, 100))

> table2
   id2 class type min_number max_number
1 1992     3    B          0         18
2 1987     3    C          0         18
3 1998     1    D         34         50
4 1998     1    A          0        100
5 2000     3    D         20        100
6 2010     1    D         45        100
7 2012     2    C          5        100

I have 2 tables. For each row in table1, I want to find the rows that match it in table2 and store the corresponding id2s. That is, I want to match on class, type, and whether or not min_number <= number <= max_number.

I wrote the for loop below to do this:

id2_list <- vector("list")
for(i in 1:nrow(table1)){
  id2_list[[i]] <- table2[which((table2$class == table1$class[i]) 
                         & (table2$type %in% table1$type[i]) 
                         & ((table2$min_number <= table1$number[i]) 
                            & (table2$max_number >= table1$number[i]))), ]$id2
}


> id2_list
[[1]]
[1] 1998

[[2]]
[1] 1998

[[3]]
[1] 1998

[[4]]
numeric(0)

[[5]]
[1] 2012

[[6]]
[1] 1998 2010

This list contains the id2s that match each of the 6 id1s in table1. It's possible for a row in table1 to match 0, 1, or more rows in table2.

Is there a faster way to do this without using a for loop? If I have millions of rows in either table. I think this would be very inefficient.

CodePudding user response:

Yes, you can do this with a pmap and filter inside a mutate as such:

library(tidyverse)

table1 %>%
  mutate(matches = pmap(list(class, type, number),
                        function(cl, ty, nu) filter(table2,
                                                    class == cl,
                                                    type == ty,
                                                    nu >= min_number,
                                                    nu <= max_number)$id2))

#    id1 class type number    matches
# 1 1324     1    A    1.0       1998
# 2 1111     1    A    2.5       1998
# 3 2324     1    A   98.0       1998
# 4 2324     2    B  100.0           
# 5 1111     2    C   80.0       2012
# 6 1111     1    D   50.0 1998, 2010

For speed, you can try using the furrr package and future_pmap, or you can use multidplyr.

With furrr:

library(tidyverse)
library(furrr)

plan(multisession, workers = parallel::detectCores() - 1)

table1 %>%
  mutate(matches = future_pmap(list(class, type, number),
                        function(cl, ty, nu) filter(table2,
                                                    class == cl,
                                                     type == ty,
                                                     nu >= min_number,
                                                     nu <= max_number)$id2))
                                                                               

#    id1 class type number    matches
# 1 1324     1    A    1.0       1998
# 2 1111     1    A    2.5       1998
# 3 2324     1    A   98.0       1998
# 4 2324     2    B  100.0           
# 5 1111     2    C   80.0       2012
# 6 1111     1    D   50.0 1998, 2010
  • Related