Home > Net >  How to merge (join) data frames based on string matching?
How to merge (join) data frames based on string matching?

Time:06-07

TL;DR: I want to merge two dataframes based on a column that contains multiple synonyms of an identifier.

This is my first ever question to stackoverflow but I have been stuck with this seemingly simple problem for far too long so I would be very grateful for any help!

I have two dataframes that both contain one or multiple character columns describing a row. I would like to merge these dataframes, and although sometimes they have a direct match, there is no single identifier column but rather multiple synonyms that could match.

Here is a simple reproducible example for the two data frames:

df1 <- data.frame("Name" = c("a", "b", "c"))
df2 <- data.frame("Symbol" = c("a", "two", "d"), 
                  "Synonym" = c("", "b | 2", "c-four"))

I would like the first row to match directly ("a"), the second row to match the synonym ("b") and the third row to not match ("c"). (The non-matched row ("d") may or may not be included in the output.)

Here is the dataframe I would like as an output of the merge:

goal_df <- data.frame("Name" = c("a", "b", "c", NA),
                      "Symbol" = c("a", "two", NA, "d"),
                      "Synonym" = c("", "b | 2", NA, "c-four"))

screenshot of goal dataframe

In pseudo-code, this is what I would like to do:

  1. create a new column in df1 called 'match' that will contain information on whether or not there is a match between df1 and df2 for any given row in df1

  2. compare df1$Name and df2$Symbol and populate df1$match with df2$Symbol if they are identical

  3. if they are not identical, look for an exact match in the df2$Synonym column, if it is found, assign the corresponding df2$Symbol to df1$match

  4. merge the two dataframes based on df1$match and df2$Symbol

Here is (some of the things) I've tried so far:

First, I split the synonyms into individual strings for each synonym to allow exact comparison (This creates a list type because there can be multiple synonyms per row):

df2 <- df2 %>% 
  mutate(syns = strsplit(df2$Synonym, split = "|", fixed = T)) 
df3 <- df1 %>% mutate(
  match = case_when(
    Name %in% df2$Symbol ~ Name,
    Name %in% df2$syns ~ df2$Symbol, #this line
    TRUE ~ NA_character_)) %>% 
left_join(df2, by = c("match" = "Symbol"), all.x = T)

Unfortunately, this line won't compute correctly for multiple reasons, I guess.

Alternatives I've tried for this line :

sapply(Name, grep, df2$syns) ~ df2$Symbol, #this line

sapply(paste0("\\b", Name, "\\b"), grep, df2$syns) ~ df2$Symbol, #this line

However, I get the following error:

Error in `mutate()`:
! Problem while computing `match = case_when(...)`.
Caused by error in `case_when()`:
! LHS of case 2 (`sapply(Name, grep, df2$syns)`) must be
  a logical vector, not a list.
Run `rlang::last_error()` to see where the error occurred.

Maybe it is the wrong approach to strsplit the Synonym column first? I'm not sure 1) how to match only exact matches (so that it doesn't match "c" and "c-four" for example) and 2) how to merge the two dataframes once I have compared the strings successfully (because it doesn't seem to be possible to assign a value from a different dataframe in case_when).

I hope I described my question thoroughly! Thanks in advance for any help you can offer. I feel hilariously stuck at this seemingly simple task.

CodePudding user response:

Perhaps this using fuzzyjoin:

(%>% may also be used instead of |>.)

library(tidyverse)
library(fuzzyjoin)

df1 <- data.frame("Name" = c("a", "b", "c"))
df2 <- data.frame("Symbol" = c("a", "two", "d"), 
                  "Synonym" = c("", "b | 2", "c-four"))

df2_regex <- df2 |> 
  mutate(regex = str_c(Symbol, Synonym, sep = "|") |> 
           str_remove_all(" ") |> 
           str_remove("\\|$") |> 
           str_replace_all("\\|", "$\\|^"),
         regex = str_c("^", regex, "$")
  )

df1 |> 
  regex_full_join(df2_regex, by = c(Name = "regex")) |> 
  select(-regex)
#>   Name Symbol Synonym
#> 1    a      a        
#> 2    b    two   b | 2
#> 3    c   <NA>    <NA>
#> 4 <NA>      d  c-four

Created on 2022-06-06 by the reprex package (v2.0.1)

  • Related