I have df1
which I would like to merge with df2
based on a common field id
id
is always in the form of 21_2342_A_C
(i.e. num_num_char_char
). I want to merge df2
into df1
if either of the last two fields (sep="_")
in id
are switched.
So, if ID
in df1
is 21_2342_A_C
, then I want it to match if the entry in df2
is either 21_2342_A_C
or 21_2342_C_A
.
Is this possible using data.table? I've developed a cumbersome way involving creating two different columns and doing two different joins, but I was hoping there'd be a more elegant solution. I'll also happily take a non data.table solution.
CodePudding user response:
This also includes creating two additional columns but only 1 merge:
dt <- data.table(
id = c("21_2342_A_C", "21_2342_C_A", "21_2342_A_B")
)
- extract number and character part of id
- sort character part
- merge if number and character parts are same
- remove merges on itself and/or duplicated merges (if row i is merged to row j then row j is merged on row i)
dt[, row_id := seq_len(.N)]
dt[, (c("id1", "id2")) := transpose(str_extract_all(dt$id, "([0-9]{2}_[0-9]{4})|([A-Z]_[A-Z])"))]
dt[, id2 := map_chr(str_split(id2, "_"), ~str_c(sort(.x), collapse = ""))]
res <- dt[dt, on = .(id1, id2)][row_id < i.row_id]
res[, c("row_id", "id1", "id2", "i.row_id") := NULL]
CodePudding user response:
I also could not figure out how to do it without an intermediate id. Here is my take:
df1 <- data.table(V1= "hello", id= "21_2342_A_C")
df2 <- data.table(V1= c("world1", "world2"), id= c("21_2342_A_C", "21_2342_C_A"))
sort_id <- function(x)
{
x <- unlist(tstrsplit(x, "_"))
return(paste0(c(x[1:2], sort(x[3:4])), collapse= "_"))
}
df1[, id2:= sort_id(id), id]
df2[, id2:= sort_id(id), id]
merge(df1,
df2,
"id2")