I would like to join 2 table1 & table 2 ( Left on COlB & right on ColD) with maximum matched strings
Table 1
ColA | ColB |
---|---|
123 | C/O room Hanbur court vaux road |
456 | House Malveri business park |
Table 2
ColD | ColC |
---|---|
Hanbur Court | Lightroom |
Malveri park | Office |
Output Table
ColA | ColB | Colc |
---|---|---|
123 | C/O room Hanbur court vaux road | Lightroom |
456 | House Malveri business park | Office |
CodePudding user response:
With fuzzyjoin
, there are options to join based on distance
library(fuzzyjoin)
library(dplyr)
stringdist_inner_join(df1, df2, by = c(ColB = "ColD"),
max_dist = 0.5, method = "jaccard") %>%
select(-ColD)
ColA ColB ColC
1 123 C/O room Hanbur court vaux road Lightroom
2 456 House Malveri business park Office
data
df1 <- structure(list(ColA = c(123L, 456L),
ColB = c("C/O room Hanbur court vaux road",
"House Malveri business park")), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(ColD = c("Hanbur Court", "Malveri park"),
ColC = c("Lightroom",
"Office")), class = "data.frame", row.names = c(NA, -2L))
CodePudding user response:
This one is quite complicated but it does the job:
library(dplyr)
library(stringr)
library(tidyr)
# prepare df2 to get pattern for `str_detect` later
df2_new <- df2 %>%
separate_rows(ColD, sep = " ") %>%
mutate(helper = tolower(ColD))
# create pattern to match
pattern <- paste(df2_new$helper, collapse = "|")
# do the calculations
df %>%
separate_rows(ColB, sep = " ") %>%
mutate(helper = tolower(ColB),
helper1 = ifelse(str_detect(helper, pattern), 1, 0)) %>%
group_by(ColA) %>%
mutate(helper = paste(helper[helper1==1], collapse = " "),
ColB = paste(ColB, collapse = " "), .keep="unused") %>%
slice(1) %>%
right_join(df2 %>%
mutate(helper = tolower(ColD)), by="helper") %>%
select(ColA, ColB, ColC)
ColA ColB ColC
<int> <chr> <chr>
1 123 C/O room Hanbur court vaux road Lightroom
2 456 House Malveri business park Office