Home > Software design >  How to quickly count number of unique entries after merging tables
How to quickly count number of unique entries after merging tables

Time:11-13

library(data.table)
table1 <- data.table(id1 = c(1324, 2324, 29, 29, 1010, 1010),
                     type = c(1, 1, 2, 1, 1, 1),
                     class = c("A",  "A", "B", "D", "D", "A"),
                     number = c(1, 98, 100, 100, 70, 70))
table2 <- data.table(id2 = c(1998, 1998, 2000, 2000, 2000, 2010, 2012, 2012),
                     type = c(1, 1, 3, 1, 1, 5, 1, 1),
                     class = c("D", "A", "D", "D", "A", "B", "A", "A"),
                     min_number = c(34, 0, 20, 45, 5, 23, 1, 1),
                     max_number = c(50, 100, 100, 100, 100, 9, 10, 100))

> table1
    id1 type class number
1: 1324    1     A      1
2: 2324    1     A     98
3:   29    2     B    100
4:   29    1     D    100
5: 1010    1     D     70
6: 1010    1     A     70
> table2
    id2 type class min_number max_number
1: 1998    1     D         34         50
2: 1998    1     A          0        100
3: 2000    3     D         20        100
4: 2000    1     D         45        100
5: 2000    1     A          5        100
6: 2010    5     B         23          9
7: 2012    1     A          1         10
8: 2012    1     A          1        100

I have two tables, and I would like to merge them based on type, class, and whether number lies between min_number and max_number. Then I would like to create a new variable nMatch that stores the number of unique id2s that match with each id1.

setindexv(table2, c("type", "class"))
for (t1_row in seq_len(nrow(table1))) {
  print(t1_row)
  set(
    table1, t1_row, "matches",
    table2[table1[t1_row], on = c("type", "class", "max_number >= number", "min_number <= number"), .(list(id2))]
  )
}
> table1[, .(nMatch = uniqueN(unlist(matches), na.rm = TRUE)), by = .(id1)]
    id1 nMatch
1: 1324      2
2: 2324      3
3:   29      1
4: 1010      3

The approach above is row-by-row as suggested here, but my real dataset has millions of rows. What's another way of doing this that's faster?

CodePudding user response:

You can try data.table with on = .(...) to merge two data tables

na.omit(
  table1[
    table2,
    .(id1, id2),
    on = .(type, class, number >= min_number, number <= max_number),
    by = .EACHI
  ]
)[
  ,
  .(nMatch = uniqueN(id2)), 
  id1
]

and will get

    id1 nMatch
1: 1324      2
2: 1010      3
3: 2324      3
4:   29      1

CodePudding user response:

An option with tidyverse

library(dplyr)
library(tidyr)
left_join(table1, table2, by = 
   join_by(type, class, number >= min_number, number <= max_number)) %>% 
  distinct(id1, id2) %>%
  drop_na %>%
  count(id1, name = "nMatch")

-output

     id1 nMatch
   <num>  <int>
1:    29      1
2:  1010      3
3:  1324      2
4:  2324      3
  • Related