Two big real life tables to join up, but here's a little reprex:
I've got a table of small strings and I want to left join on a second table, with the join being based on whether or not these small strings can be found inside the bigger strings on the second table.
df_1 <- data.frame(index = 1:5,
keyword = c("john", "ella", "mil", "nin", "billi"))
df_2 <- data.frame(index_2 = 1001:1008,
name = c("John Coltrane", "Ella Fitzgerald", "Miles Davis", "Billie Holliday",
"Nina Simone", "Bob Smith", "John Brown", "Tony Montana"))
df_results_i_want <- data.frame(index = c(1, 1:5),
keyword = c("john", "john", "ella", "mil", "nin", "billi"),
index_2 = c(1001, 1007, 1002, 1003, 1005, 1004),
name = c("John Coltrane", "John Brown", "Ella Fitzgerald",
"Miles Davis", "Nina Simone", "Billie Holliday"))
Seems like a str_detect()
call and a left_join()
call might be part of the solution - ie I'm hoping for something like:
library(tidyverse)
df_results <- df_1 |> left_join(df_2, join_by(blah blah str_detect() blah blah))
I'm using dplyr 1.1 so I can use join_by()
, but I'm not sure of the correct way to get what I need - can anyone help please?
I suppose I could do a simple cross join using tidyr::crossing()
and then do the str_detect()
stuff afterwards (and filter out things that don't match)
df_results <- df_1 |>
crossing(df_2) |>
mutate(match = str_detect(name, fixed(keyword, ignore_case = TRUE))) |>
filter(match) |>
select(-match)
but in my real life example, the cross join would produce an absolutely enormous table that would overwhelm my PC.
Thank you.
CodePudding user response:
You can try fuzzy_join::regex_join()
:
library(fuzzyjoin)
regex_join(df_2, df_1, by=c("name"="keyword"), ignore_case=T)
Output:
index.x name index.y keyword
1 1001 John Coltrane 1 john
2 1002 Ella Fitzgerald 2 ella
3 1003 Miles Davis 3 mil
4 1004 Billie Holliday 5 billi
5 1005 Nina Simone 4 nin
6 1007 John Brown 1 john
CodePudding user response:
join_by
does not support inexact join (but unequal), but you can use fuzzyjoin
:
library(dplyr)
library(fuzzyjoin)
df_2 %>%
mutate(name = tolower(name)) %>%
fuzzy_left_join(df_1, ., by = c(keyword = "name"),
match_fun = \(x, y) str_detect(y, x))
index keyword index_2 name
1 1 john 1001 john coltrane
2 1 john 1007 john brown
3 2 ella 1002 ella fitzgerald
4 3 mil 1003 miles davis
5 4 nin 1005 nina simone
6 5 billi 1004 billie holliday
CodePudding user response:
We can use SQL to do that.
library(sqldf)
sqldf("select * from [df_1] A
left join [df_2] B on B.name like '%' || A.keyword || '%'")
giving:
index keyword index_2 name
1 1 john 1001 John Coltrane
2 1 john 1007 John Brown
3 2 ella 1002 Ella Fitzgerald
4 3 mil 1003 Miles Davis
5 4 nin 1005 Nina Simone
6 5 billi 1004 Billie Holliday