Hi I want to merge df1 and df2 based on x.
df1<-tibble(x=c("TRP OVERSEAS STOCK |","PIMCO TOTAL RETURN FUND"),y=c(1,2))
df2<-tibble(x=c("AB Portfolios: AB Growth Fund; Class K Shares","PIMCO TOTAL RETURN FUND"),z=c(2020,2021))
However, when I use
fuzzy_join(df1, df2, match_fun = function(x,y) str_detect(y, x), by = "x")
It gives me the following output,
"TRP OVERSEAS STOCK |"
should not be matched with anything. But they are listed in the output. I'm wondering why does that happen and how to solve this issue? Thank you!!
CodePudding user response:
The issue is with |
, as it is being used as an "or". So, it is just joining to the other rows. However, we could remove it first, then still use fuzzyjoin
.
library(fuzzyjoin)
library(tidyverse)
df1 %>%
mutate(x = str_replace_all(x, "[|]", "")) %>%
fuzzy_join(., df2, match_fun = function(x,y) str_detect(y, x), by = "x")
Or if you don't want to load more packages, then we could make the change in base R first.
df1$x <- gsub("[|]", "", df1$x)
fuzzy_join(df1, df2, match_fun = function(x,y) str_detect(y, x), by = "x")
Output
x.x y x.y z
<chr> <dbl> <chr> <dbl>
1 PIMCO TOTAL RETURN FUND 2 PIMCO TOTAL RETURN FUND 2021
CodePudding user response:
You have to decide whether to fuzzyjoin or not. If fuzzyjoin follow the answer by @AndrewGB and also Fuzzy matching two data frames your last question.
And if not fuzzyjoin then:
df2 %>% inner_join(df1, by = "x")
x z y
<chr> <dbl> <dbl>
1 PIMCO TOTAL RETURN FUND 2021 2