I am trying to left-join df2
onto df1
.
df1
is my dataframe of interest, df2
contains additional information I need.
Example:
#df of interest onto which the other should be joined
key1 <- c("London", "Paris", "Berlin", "Delhi")
other_stuff <- c("Tea", "Coffee", "Beer", "Tea")
df1 <- data.frame(key1, other_stuff)
#additional info df
key2 <- c("London and other cities", "some other city", "Eastberlin is history", "Berlin", "Delia is a name", "Delhi is a place")
more_info <- c("history", "languages", "trades", "art", "commerce", "manufacturing")
df2 <- data.frame(key2,more_info)
What I now want is that df2$key2
is searched for the precise occurrence of df1$key1
and then merged onto df1
(e.g. match Berlin to Berlin, but not to Eastberlin, and Delhi to Delhi but not to Delia) while ignoring the other words surrounding the match.
Desired outcome:
key1 | other_stuff | more_info |
---|---|---|
London | Tea | history |
Paris | Coffee | NA |
Berlin | Beer | art |
Delhi | Tea | manufacturing |
I tried variations of regex_left_join
joined<- regex_left_join(df1,df2, by = c("key1" = "key2"), ignore_case= F)
and fuzzyjoins
joined<- df1%>% fuzzy_left_join(df2, by = c("key1" = "key2"), match_fun = str_detect)
They both only give a result for the exact match (key1=key2=Berlin) and give NA for everything else.
How do I do this?
I also tried Merging two tables where one column is substring of the other in R but the logic in the SQL there is the wrong way around. I tried several other Stackexchange approaches, but they are "too fuzzy" for my data.
CodePudding user response:
Here I use the "regular" dplyr::left_join
, but performed some selection in df2
when joining it with df1
.
First create a vector that contains your target city. Then I will split df2$key2
by white space, and see if there is any word that matches the strings in the vector city
. Then left_join
it with df1
.
library(tidyverse)
city <- c("London", "Paris", "Berlin", "Delhi")
left_join(df1,
df2 %>% mutate(city = sapply(strsplit(df2$key2, " "),
function(x) first(intersect(city, x)))),
by = c("key1" = "city")) %>%
select(-key2)
key1 other_stuff more_info
1 London Tea history
2 Paris Coffee <NA>
3 Berlin Beer art
4 Delhi Tea manufacturing
CodePudding user response:
The following works with the posted data examples but it uses two joins and is probably ineffective for larger data sets.
library(dplyr)
library(fuzzyjoin)
left_join(
df1,
regex_left_join(df2, df1, by = c(key2 = "key1"))[c(3, 4, 2)] |> na.omit()
)
#> Joining, by = c("key1", "other_stuff")
#> key1 other_stuff more_info
#> 1 London Tea history
#> 2 Paris Coffee <NA>
#> 3 Berlin Beer art
#> 4 Delhi Tea manufacturing
Created on 2022-02-16 by the reprex package (v2.0.1)
CodePudding user response:
You don't get the expected result because those functions passes the second dataframe as the regex pattern, so instead you could use regex_right_join
or fuzzy_right_join
:
df1 %>%
regex_right_join(df2, ., by = c(key2 = "key1")) %>%
select(key1, other_stuff, more_info)
df1 %>%
fuzzy_right_join(df2, ., by = c(key2 = "key1"), match_fun = str_detect) %>%
select(key1, other_stuff, more_info)
output
key1 other_stuff more_info
1 London Tea history
2 Paris Coffee <NA>
3 Berlin Beer art
4 Delhi Tea manufacturing