data.table join based on switched string combinations-CodePudding

I have df1 which I would like to merge with df2 based on a common field id

id is always in the form of 21_2342_A_C (i.e. num_num_char_char). I want to merge df2 into df1 if either of the last two fields (sep="_") in id are switched.

So, if ID in df1 is 21_2342_A_C, then I want it to match if the entry in df2 is either 21_2342_A_C or 21_2342_C_A.

Is this possible using data.table? I've developed a cumbersome way involving creating two different columns and doing two different joins, but I was hoping there'd be a more elegant solution. I'll also happily take a non data.table solution.

CodePudding user response：

This also includes creating two additional columns but only 1 merge:

dt <- data.table(
  id = c("21_2342_A_C", "21_2342_C_A", "21_2342_A_B")
)

extract number and character part of id
sort character part
merge if number and character parts are same
remove merges on itself and/or duplicated merges (if row i is merged to row j then row j is merged on row i)

dt[, row_id := seq_len(.N)]
dt[, (c("id1", "id2")) := transpose(str_extract_all(dt$id, "([0-9]{2}_[0-9]{4})|([A-Z]_[A-Z])"))]
dt[, id2 := map_chr(str_split(id2, "_"), ~str_c(sort(.x), collapse = ""))]
res <- dt[dt, on = .(id1, id2)][row_id < i.row_id]

res[, c("row_id", "id1", "id2", "i.row_id") := NULL]

CodePudding user response：

I also could not figure out how to do it without an intermediate id. Here is my take:

df1 <- data.table(V1= "hello", id= "21_2342_A_C")
df2 <- data.table(V1= c("world1", "world2"), id= c("21_2342_A_C", "21_2342_C_A"))

sort_id <- function(x)
{
  x <- unlist(tstrsplit(x, "_"))
  return(paste0(c(x[1:2], sort(x[3:4])), collapse= "_"))
}

df1[, id2:= sort_id(id), id]
df2[, id2:= sort_id(id), id]

merge(df1, 
      df2, 
      "id2")