Probably a simple problem and you can help me quickly.
I have a vector with all the terms contained in a list of keywords. Now I want to join each term with all keywords that contain this term. Here's an example
vec <- c("small", "boat", "river", "house", "tour", "a", "on", "the", "houseboat", …)
keywords <- c("small boat tour", "a house on the river", "a houseboat", …)
The expected result looks like:
keywords terms
small boat tour small
small boat tour boat
small boat tour tour
a house on the river a
a house on the river house
a house on the river on
a house on the river the
a house on the river river
a houseboat a
a houseboat houseboat
CodePudding user response:
You can use expand.grid
to get all combinations, wrap the words of vec
in word boundaries, grepl
and filter, i.e.
df1 <- expand.grid(vec, keywords)
df1[mapply(grepl, paste0('\\b' ,df1$Var1, '\\b'), df1$Var2),]
Var1 Var2
1 small small boat tour
2 boat small boat tour
5 tour small boat tour
12 river a house on the river
13 house a house on the river
15 a a house on the river
16 on a house on the river
17 the a house on the river
24 a a houseboat
27 houseboat a houseboat
CodePudding user response:
You can do a fuzzyjoin::fuzzy_join
using stringr::str_detect
as the matching function, and adding \\b
word boundaries to each word in vec
.
vec <- data.frame(terms = c("small", "boat", "river", "house", "tour", "a", "on", "the", "houseboat"))
keywords <- data.frame(keywords = c("small boat tour", "a house on the river", "a houseboat"))
fuzzyjoin::fuzzy_inner_join(keywords, vec, by = c("keywords" = "terms"),
match_fun = \(x, y) stringr::str_detect(x, paste0("\\b", y, "\\b")))
output
keywords terms
1 small boat tour small
2 small boat tour boat
3 small boat tour tour
4 a house on the river river
5 a house on the river house
6 a house on the river a
7 a house on the river on
8 a house on the river the
9 a houseboat a
10 a houseboat houseboat
CodePudding user response:
A way can be using strsplit
and intersect
.
. <- lapply(strsplit(keywords, " ", TRUE), intersect, vec)
data.frame(keywords = rep(keywords, lengths(.)), terms = unlist(.))
# keywords terms
#1 small boat tour small
#2 small boat tour boat
#3 small boat tour tour
#4 a house on the river a
#5 a house on the river house
#6 a house on the river on
#7 a house on the river the
#8 a house on the river river
#9 a houseboat a
#10 a houseboat houseboat