Copy only one variable from one R data.table to another after matching on a variable-CodePudding

I can conduct a matched copy of a column in a data frame into another by something like

DF2$y <- DF1[match(DF2$id2, DF1$id1), "z"]  # DF1 and DF2 are data frames

where DF2$id2 is matched to DF1$id1. I would like to know what I can do with data tables for this kind of operation. My data tables have millions of rows and hundreds of columns. I've done setkey(DT1, id1) and setkey(DT2, id2).

This works:

DT2[, y := DT1[match(DT2$id2, DT1$id1), "z"]]  # DT1 and DT2 are data tables

but I am afraid that the match part might take long than necessary. (Or is it inevitable?)

I understand that I can also use column selection, merge, and renaming:

tmp <- DT1[, c("id1", "z")]  # column selection
DT3 <- merge(DT2, tmp, by.x = "id2", by.y = "id1", all.x = TRUE, suffixes = c("", ".y")) # merge
setnames(DT3, "z.y", "y")  # rename

(where the first two lines can be written on one line) but this seems a bit too complicated. Would there be a simpler and fast solution?

Thanks.

Example:

DF1 <- data.frame(id1=2:4, x=LETTERS[1:3], z=11:13)
DF2 <- data.frame(id2=1:4, x=LETTERS[5:8], z=21:24)
DF1
#   id1 x  z
# 1   2 A 11
# 2   3 B 12
# 3   4 C 13
DF2
#   id2 x  z
# 1   1 E 21
# 2   2 F 22
# 3   3 G 23
# 4   4 H 24

DT1 <- data.table(DF1)
DT2 <- data.table(DF2)
setkey(DT1, id1)
setkey(DT2, id2)

DF2$y <- DF1[match(DF2$id2, DF1$id1), "z"]
DF2  # correct
#   id2 x  z  y
# 1   1 E 21 NA
# 2   2 F 22 11
# 3   3 G 23 12
# 4   4 H 24 13

DT2[, y := DT1[match(DT2$id2, DT1$id1), "z"]]
DT2
#    id2 x  z  y
# 1:   1 E 21 NA
# 2:   2 F 22 11
# 3:   3 G 23 12
# 4:   4 H 24 13
DT2[, y := NULL]

tmp <- DT1[, c("id1", "z")]
DT3 <- merge(DT2, tmp, by.x = "id2", by.y = "id1", all.x = TRUE, suffixes = c("", ".y"))
setnames(DT3, "z.y", "y")
DT3
#    id2 x  z  y
# 1:   1 E 21 NA
# 2:   2 F 22 11
# 3:   3 G 23 12
# 4:   4 H 24 13

## Simpler alternatives?

CodePudding user response：

If I understand correctly, the OP wants to append column z from DT1 to DT2 as column y where the id columns match.

With data.table, this can be solved using an update join:

library(data.table)
DT2[DT1, on = .(id2 = id1), y := i.z]
DT2

   id2 x  z  y
1:   1 E 21 NA
2:   2 F 22 11
3:   3 G 23 12
4:   4 H 24 13

Note that DT2 is updated by reference, i.e., without copying the whole data object. This might be handy for OP's large production datasets of millions of rows.

CodePudding user response：

I have two possible ideas:

Using merge and select the columns and do the renaming there

DT3 <- merge(DT2, DT1[, .(id1, y = z)], by.x = "id2", by.y = "id1", all.x = TRUE)

Using data.table's on syntax. I based this solution directly off of the answer here, which shows how to do a full outer join: https://stackoverflow.com/a/46904676/9244371

unique_keys <- unique(c(DT1[, id1], DT2[, id2]))
DT3 <- DT2[DT1[.(unique_keys), .(id1, y = z)]]