Home > other >  How to join tables using setDT in R
How to join tables using setDT in R

Time:09-01

library(data.table)
table1 <- data.frame(id1 = c(1324, 7822, 2324, 29, 9999, 1010),
                     class = c(1, 1, 1, 2, 2, 1),
                     type = c("A", "A", "A", "B", "C", "D"),
                     number = c(1, 2.5, 98, 100, 80, 50))
table2 <- data.frame(id2 = c(1992, 1987, 1998, 1998, 2000, 2000, 2000, 2010, 2012),
                     class = c(3, 3, 1, 1, 3, 1, 2, 5, 1),
                     type = c("B", "C", "D", "A", "D", "D", "C", "B", "A"),
                     min_number = c(0, 0, 34, 0, 20, 45, 5, 23, 1),
                     max_number = c(18, 18, 50, 100, 100, 100, 100, 9, 10))

I have 2 tables that look like this:

> table1
   id1 class type number
1 1324     1    A    1.0
2 7822     1    A    2.5
3 2324     1    A   98.0
4   29     2    B  100.0
5 9999     2    C   80.0
6 1010     1    D   50.0

> table2
   id2 class type min_number max_number
1 1992     3    B          0         18
2 1987     3    C          0         18
3 1998     1    D         34         50
4 1998     1    A          0        100
5 2000     3    D         20        100
6 2000     1    D         45        100
7 2000     2    C          5        100
8 2010     5    B         23          9
9 2012     1    A          1         10

I would like to merge the two tables based on the same class and type. In particular, I would like to extract the id2s from table2 and store them in tab1_tab2.

tab1_tab2 <- setDT(table1)[setDT(table2), on = c("class", "type"), 
           id2 := id2]

> tab1_tab2
    id1 class type number  id2
1: 1324     1    A    1.0 2012
2: 7822     1    A    2.5 2012
3: 2324     1    A   98.0 2012
4:   29     2    B  100.0   NA
5: 9999     2    C   80.0 2000
6: 1010     1    D   50.0 2000

Notice that for class=1 and type=A, there are actually 2 id2s: 1998 and 2012. However, tab1_tab2 only stored 2012.

I would like the final output to store all possible id2s, each separated by ;. It should look something like this:

> final_tab1_tab2
    id1 class type number       id2
1: 1324     1    A    1.0 1998;2012
2: 7822     1    A    2.5 1998;2012
3: 2324     1    A   98.0 1998;2012
4:   29     2    B  100.0        NA
5: 9999     2    C   80.0      2000
6: 1010     1    D   50.0 1998;2000

This is a toy example, so the tables are small. In reality, my tables have thousands of rows, so I want to avoid using a for loop.

CodePudding user response:

Another option:

setDT(table2)[setDT(table1),.(class,type,number,id1,id2=x.id2),on=("class", "type")][
              ,.(id2=.(id2)),by=.(class,type,id1,number)]

# class   type   id1 number       id2
# <num> <char> <num>  <num>    <list>
#   1:     1      A  1324    1.0 1998,2012
# 2:     1      A  7822    2.5 1998,2012
# 3:     1      A  2324   98.0 1998,2012
# 4:     2      B    29  100.0        NA
# 5:     2      C  9999   80.0      2000
# 6:     1      D  1010   50.0 1998,2000

CodePudding user response:

You can do

table1[, id2 := table2[, .(id2 = paste(id2, collapse = ";")), by = .(class, type)][table1, on = .(class, type)]$id2]
#               <-- 1) aggreagtion                                             -->
#               <-- 2) left-join                                                                            -->
#                                                                                                              3) extract
#        <-- 4) assignment by reference                                                                          -->

Result

table1
#    id1 class type number       id2
#1: 1324     1    A    1.0 1998;2012
#2: 7822     1    A    2.5 1998;2012
#3: 2324     1    A   98.0 1998;2012
#4:   29     2    B  100.0      <NA>
#5: 9999     2    C   80.0      2000
#6: 1010     1    D   50.0 1998;2000

Above we

  1. aggregate table2, then
  2. perform a left-join,
  3. extract new id2 from the result of 2.
  4. and assign id2 by reference to table1

data

library(data.table)
table1 <- data.table(id1 = c(1324, 7822, 2324, 29, 9999, 1010),
                     class = c(1, 1, 1, 2, 2, 1),
                     type = c("A", "A", "A", "B", "C", "D"),
                     number = c(1, 2.5, 98, 100, 80, 50))
table2 <- data.table(id2 = c(1992, 1987, 1998, 1998, 2000, 2000, 2000, 2010, 2012),
                     class = c(3, 3, 1, 1, 3, 1, 2, 5, 1),
                     type = c("B", "C", "D", "A", "D", "D", "C", "B", "A"),
                     min_number = c(0, 0, 34, 0, 20, 45, 5, 23, 1),
                     max_number = c(18, 18, 50, 100, 100, 100, 100, 9, 10))
  • Related