Join two dataframes on one column that contains substring of other-CodePudding

I am trying to left-join df2 onto df1.

df1 is my dataframe of interest, df2 contains additional information I need.

Example:

#df of interest onto which the other should be joined
key1 <- c("London", "Paris", "Berlin", "Delhi") 
other_stuff <- c("Tea", "Coffee", "Beer", "Tea") 
df1 <- data.frame(key1, other_stuff)

#additional info df
key2 <- c("London and other cities", "some other city", "Eastberlin is history", "Berlin", "Delia is a name", "Delhi is a place") 
more_info <- c("history", "languages", "trades", "art", "commerce", "manufacturing")
df2 <- data.frame(key2,more_info)

What I now want is that df2$key2 is searched for the precise occurrence of df1$key1 and then merged onto df1 (e.g. match Berlin to Berlin, but not to Eastberlin, and Delhi to Delhi but not to Delia) while ignoring the other words surrounding the match.

Desired outcome:

key1	other_stuff	more_info
London	Tea	history
Paris	Coffee	NA
Berlin	Beer	art
Delhi	Tea	manufacturing

I tried variations of regex_left_join joined<- regex_left_join(df1,df2, by = c("key1" = "key2"), ignore_case= F)

and fuzzyjoins joined<- df1%>% fuzzy_left_join(df2, by = c("key1" = "key2"), match_fun = str_detect)

They both only give a result for the exact match (key1=key2=Berlin) and give NA for everything else.

How do I do this?

I also tried Merging two tables where one column is substring of the other in R but the logic in the SQL there is the wrong way around. I tried several other Stackexchange approaches, but they are "too fuzzy" for my data.

CodePudding user response：

Here I use the "regular" dplyr::left_join, but performed some selection in df2 when joining it with df1.

First create a vector that contains your target city. Then I will split df2$key2 by white space, and see if there is any word that matches the strings in the vector city. Then left_join it with df1.

library(tidyverse)

city <- c("London", "Paris", "Berlin", "Delhi")

left_join(df1,
          df2 %>% mutate(city = sapply(strsplit(df2$key2, " "), 
                                       function(x) first(intersect(city, x)))),
          by = c("key1" = "city")) %>% 
  select(-key2)

    key1 other_stuff     more_info
1 London         Tea       history
2  Paris      Coffee          <NA>
3 Berlin        Beer           art
4  Delhi         Tea manufacturing

CodePudding user response：

The following works with the posted data examples but it uses two joins and is probably ineffective for larger data sets.

library(dplyr)
library(fuzzyjoin)

left_join(
  df1,
  regex_left_join(df2, df1, by = c(key2 = "key1"))[c(3, 4, 2)] |> na.omit()
)
#> Joining, by = c("key1", "other_stuff")
#>     key1 other_stuff     more_info
#> 1 London         Tea       history
#> 2  Paris      Coffee          <NA>
#> 3 Berlin        Beer           art
#> 4  Delhi         Tea manufacturing

^{Created on 2022-02-16 by the reprex package (v2.0.1)}

CodePudding user response：

You don't get the expected result because those functions passes the second dataframe as the regex pattern, so instead you could use regex_right_join or fuzzy_right_join:

df1 %>% 
  regex_right_join(df2, ., by = c(key2 = "key1")) %>% 
  select(key1, other_stuff, more_info)

df1 %>% 
  fuzzy_right_join(df2, ., by = c(key2 = "key1"), match_fun = str_detect) %>% 
  select(key1, other_stuff, more_info)

output

    key1 other_stuff     more_info
1 London         Tea       history
2  Paris      Coffee          <NA>
3 Berlin        Beer           art
4  Delhi         Tea manufacturing