Home > Mobile >  Join each term with list of keywords
Join each term with list of keywords

Time:05-02

Probably a simple problem and you can help me quickly.

I have a vector with all the terms contained in a list of keywords. Now I want to join each term with all keywords that contain this term. Here's an example

vec <- c("small", "boat", "river", "house", "tour", "a", "on", "the", "houseboat", …)
keywords <- c("small boat tour", "a house on the river", "a houseboat", …)

The expected result looks like:

              keywords     terms
       small boat tour     small
       small boat tour      boat
       small boat tour      tour
  a house on the river         a
  a house on the river     house
  a house on the river        on
  a house on the river       the
  a house on the river     river
           a houseboat         a
           a houseboat  houseboat

CodePudding user response:

You can use expand.grid to get all combinations, wrap the words of vec in word boundaries, grepl and filter, i.e.

df1 <- expand.grid(vec, keywords)
df1[mapply(grepl, paste0('\\b' ,df1$Var1, '\\b'), df1$Var2),]

        Var1                 Var2
1      small      small boat tour
2       boat      small boat tour
5       tour      small boat tour
12     river a house on the river
13     house a house on the river
15         a a house on the river
16        on a house on the river
17       the a house on the river
24         a          a houseboat
27 houseboat          a houseboat

CodePudding user response:

You can do a fuzzyjoin::fuzzy_join using stringr::str_detect as the matching function, and adding \\b word boundaries to each word in vec.

vec <- data.frame(terms = c("small", "boat", "river", "house", "tour", "a", "on", "the", "houseboat"))
keywords <- data.frame(keywords = c("small boat tour", "a house on the river", "a houseboat"))

fuzzyjoin::fuzzy_inner_join(keywords, vec, by = c("keywords" = "terms"), 
                            match_fun = \(x, y) stringr::str_detect(x, paste0("\\b", y, "\\b")))

output

               keywords     terms
1       small boat tour     small
2       small boat tour      boat
3       small boat tour      tour
4  a house on the river     river
5  a house on the river     house
6  a house on the river         a
7  a house on the river        on
8  a house on the river       the
9           a houseboat         a
10          a houseboat houseboat

CodePudding user response:

A way can be using strsplit and intersect.

. <- lapply(strsplit(keywords, " ", TRUE), intersect, vec)
data.frame(keywords = rep(keywords, lengths(.)), terms = unlist(.))
#               keywords     terms
#1       small boat tour     small
#2       small boat tour      boat
#3       small boat tour      tour
#4  a house on the river         a
#5  a house on the river     house
#6  a house on the river        on
#7  a house on the river       the
#8  a house on the river     river
#9           a houseboat         a
#10          a houseboat houseboat
  • Related