Home > Mobile >  Fuzzy matching gives unmatched data
Fuzzy matching gives unmatched data

Time:04-28

Hi I want to merge df1 and df2 based on x.

df1<-tibble(x=c("TRP OVERSEAS STOCK |","PIMCO TOTAL RETURN FUND"),y=c(1,2))

df2<-tibble(x=c("AB Portfolios: AB Growth Fund; Class K Shares","PIMCO TOTAL RETURN FUND"),z=c(2020,2021))

However, when I use

fuzzy_join(df1, df2, match_fun = function(x,y) str_detect(y, x), by = "x")

It gives me the following output,

enter image description here

"TRP OVERSEAS STOCK |" should not be matched with anything. But they are listed in the output. I'm wondering why does that happen and how to solve this issue? Thank you!!

CodePudding user response:

The issue is with |, as it is being used as an "or". So, it is just joining to the other rows. However, we could remove it first, then still use fuzzyjoin.

library(fuzzyjoin)
library(tidyverse)

df1 %>% 
  mutate(x = str_replace_all(x, "[|]", "")) %>% 
  fuzzy_join(., df2, match_fun = function(x,y) str_detect(y, x), by = "x")

Or if you don't want to load more packages, then we could make the change in base R first.

df1$x <- gsub("[|]", "", df1$x)

fuzzy_join(df1, df2, match_fun = function(x,y) str_detect(y, x), by = "x")

Output

  x.x                         y x.y                         z
  <chr>                   <dbl> <chr>                   <dbl>
1 PIMCO TOTAL RETURN FUND     2 PIMCO TOTAL RETURN FUND  2021

CodePudding user response:

You have to decide whether to fuzzyjoin or not. If fuzzyjoin follow the answer by @AndrewGB and also Fuzzy matching two data frames your last question.

And if not fuzzyjoin then:

df2 %>% inner_join(df1, by = "x")
  x                           z     y
  <chr>                   <dbl> <dbl>
1 PIMCO TOTAL RETURN FUND  2021     2
  • Related