I can conduct a matched copy of a column in a data frame into another by something like
DF2$y <- DF1[match(DF2$id2, DF1$id1), "z"] # DF1 and DF2 are data frames
where DF2$id2
is matched to DF1$id1
. I would like to know what I can do with data tables for this kind of operation. My data tables have millions of rows and hundreds of columns. I've done setkey(DT1, id1)
and setkey(DT2, id2)
.
This works:
DT2[, y := DT1[match(DT2$id2, DT1$id1), "z"]] # DT1 and DT2 are data tables
but I am afraid that the match
part might take long than necessary. (Or is it inevitable?)
I understand that I can also use column selection, merge
, and renaming:
tmp <- DT1[, c("id1", "z")] # column selection
DT3 <- merge(DT2, tmp, by.x = "id2", by.y = "id1", all.x = TRUE, suffixes = c("", ".y")) # merge
setnames(DT3, "z.y", "y") # rename
(where the first two lines can be written on one line) but this seems a bit too complicated. Would there be a simpler and fast solution?
Thanks.
Example:
DF1 <- data.frame(id1=2:4, x=LETTERS[1:3], z=11:13)
DF2 <- data.frame(id2=1:4, x=LETTERS[5:8], z=21:24)
DF1
# id1 x z
# 1 2 A 11
# 2 3 B 12
# 3 4 C 13
DF2
# id2 x z
# 1 1 E 21
# 2 2 F 22
# 3 3 G 23
# 4 4 H 24
DT1 <- data.table(DF1)
DT2 <- data.table(DF2)
setkey(DT1, id1)
setkey(DT2, id2)
DF2$y <- DF1[match(DF2$id2, DF1$id1), "z"]
DF2 # correct
# id2 x z y
# 1 1 E 21 NA
# 2 2 F 22 11
# 3 3 G 23 12
# 4 4 H 24 13
DT2[, y := DT1[match(DT2$id2, DT1$id1), "z"]]
DT2
# id2 x z y
# 1: 1 E 21 NA
# 2: 2 F 22 11
# 3: 3 G 23 12
# 4: 4 H 24 13
DT2[, y := NULL]
tmp <- DT1[, c("id1", "z")]
DT3 <- merge(DT2, tmp, by.x = "id2", by.y = "id1", all.x = TRUE, suffixes = c("", ".y"))
setnames(DT3, "z.y", "y")
DT3
# id2 x z y
# 1: 1 E 21 NA
# 2: 2 F 22 11
# 3: 3 G 23 12
# 4: 4 H 24 13
## Simpler alternatives?
CodePudding user response:
If I understand correctly, the OP wants to append column z
from DT1
to DT2
as column y
where the id columns match.
With data.table, this can be solved using an update join:
library(data.table)
DT2[DT1, on = .(id2 = id1), y := i.z]
DT2
id2 x z y
1: 1 E 21 NA
2: 2 F 22 11
3: 3 G 23 12
4: 4 H 24 13
Note that DT2
is updated by reference, i.e., without copying the whole data object. This might be handy for OP's large production datasets of millions of rows.
CodePudding user response:
I have two possible ideas:
- Using
merge
and select the columns and do the renaming there
DT3 <- merge(DT2, DT1[, .(id1, y = z)], by.x = "id2", by.y = "id1", all.x = TRUE)
- Using
data.table
'son
syntax. I based this solution directly off of the answer here, which shows how to do a full outer join: https://stackoverflow.com/a/46904676/9244371
unique_keys <- unique(c(DT1[, id1], DT2[, id2]))
DT3 <- DT2[DT1[.(unique_keys), .(id1, y = z)]]