Joining Dataframes in R, Matching Patterns in Strings-CodePudding

Two big real life tables to join up, but here's a little reprex:

I've got a table of small strings and I want to left join on a second table, with the join being based on whether or not these small strings can be found inside the bigger strings on the second table.

df_1 <- data.frame(index = 1:5,
                   keyword = c("john", "ella", "mil", "nin", "billi"))

df_2 <- data.frame(index_2 = 1001:1008,
                   name = c("John Coltrane", "Ella Fitzgerald", "Miles Davis", "Billie Holliday", 
                            "Nina Simone", "Bob Smith", "John Brown", "Tony Montana"))

df_results_i_want <- data.frame(index = c(1, 1:5),
                                keyword = c("john", "john", "ella", "mil", "nin", "billi"),
                                index_2 = c(1001, 1007, 1002, 1003, 1005, 1004),
                                name    = c("John Coltrane", "John Brown", "Ella Fitzgerald", 
                                            "Miles Davis", "Nina Simone", "Billie Holliday"))

Seems like a str_detect() call and a left_join() call might be part of the solution - ie I'm hoping for something like:

library(tidyverse)

df_results <- df_1 |> left_join(df_2, join_by(blah blah str_detect() blah blah))

I'm using dplyr 1.1 so I can use join_by(), but I'm not sure of the correct way to get what I need - can anyone help please?

I suppose I could do a simple cross join using tidyr::crossing() and then do the str_detect() stuff afterwards (and filter out things that don't match)

df_results <- df_1 |>
  crossing(df_2) |>
  mutate(match = str_detect(name, fixed(keyword, ignore_case = TRUE))) |>
  filter(match) |>
  select(-match)

but in my real life example, the cross join would produce an absolutely enormous table that would overwhelm my PC.

Thank you.

CodePudding user response：

You can try fuzzy_join::regex_join():

library(fuzzyjoin)
regex_join(df_2, df_1, by=c("name"="keyword"), ignore_case=T)

Output:

  index.x            name index.y keyword
1    1001   John Coltrane       1    john
2    1002 Ella Fitzgerald       2    ella
3    1003     Miles Davis       3     mil
4    1004 Billie Holliday       5   billi
5    1005     Nina Simone       4     nin
6    1007      John Brown       1    john

CodePudding user response：

join_by does not support inexact join (but unequal), but you can use fuzzyjoin:

library(dplyr)
library(fuzzyjoin)
df_2 %>% 
  mutate(name = tolower(name)) %>% 
  fuzzy_left_join(df_1, ., by = c(keyword = "name"), 
                  match_fun = \(x, y) str_detect(y, x))

  index keyword index_2            name
1     1    john    1001   john coltrane
2     1    john    1007      john brown
3     2    ella    1002 ella fitzgerald
4     3     mil    1003     miles davis
5     4     nin    1005     nina simone
6     5   billi    1004 billie holliday

CodePudding user response：

We can use SQL to do that.

library(sqldf)

sqldf("select * from [df_1] A
  left join [df_2] B on B.name like '%' || A.keyword || '%'")

giving:

  index keyword index_2            name
1     1    john    1001   John Coltrane
2     1    john    1007      John Brown
3     2    ella    1002 Ella Fitzgerald
4     3     mil    1003     Miles Davis
5     4     nin    1005     Nina Simone
6     5   billi    1004 Billie Holliday