Is there an easy way of text searching using lookup tables in R?-CodePudding

I have a use case where I have lots of 'lookup tables', i.e. dataframes containing strings I am searching for in rows within a large second dataframe. I need to extract rows where a string exists within the dataframe but there may be other strings in the dataframe. I also need to extract the whole row and that of the lookup table when a match is found.

I've successfully achieved what I need via a nested for loop, but my actual dataset is massive and the lookup table will be circa 50,000 rows. So a for loop is going to be very inefficient. I have had success using dplyr::semi_join but that only works when the entries match exactly, whereas I am searching for a single word in a longer string:


fruit_lookup <- data.frame(fruit=c("banana","apple","pear","plum"), rating=c(3,4,3,5))
products <- data.frame(product_code=c("535A","535B","283G","786X","765G"), product_name=c("banana drops","apple juice","melon juice","coconut oil","strawberry jelly"))
results <- data.frame(product_code=NA, product_name=NA, fruit=NA, rating=NA)

for(i in 1:nrow(products)) {
  for(j in 1:nrow(fruit_lookup)){
    if(stringr::str_detect(products$product_name[i], fruit_lookup$fruit[j])) {
      results <- tibble::add_row(results)
      results$product_code[i] <- products$product_code[i]
      results$product_name[i] <- products$product_name[i]
      results$fruit[i] <- fruit_lookup$fruit[j]
      results$rating[i] <- fruit_lookup$rating[j]
      break
    }
    }
  }

results <- stats::na.omit(results)
print(results)

This yields the result I am wanting:

product_code product_name fruit rating

535A banana drops banana 3

535B apple juice apple 4

Any advice gratefully received and I won't be hurt if I have missed something obvious. Please feel free to critique my other coding practices, which may not be ideal!

CodePudding user response：

a direct replacement for what you have might be


library(tidyverse)

products |>
  mutate(prod_tokens = str_split(string = product_name, 
                                 pattern = " ")) |>
  unnest_longer(col = prod_tokens) |>
  inner_join(fruit_lookup, by = c("prod_tokens" = "fruit"))