I have a use case where I have lots of 'lookup tables', i.e. dataframes containing strings I am searching for in rows within a large second dataframe. I need to extract rows where a string exists within the dataframe but there may be other strings in the dataframe. I also need to extract the whole row and that of the lookup table when a match is found.
I've successfully achieved what I need via a nested for loop, but my actual dataset is massive and the lookup table will be circa 50,000 rows. So a for loop is going to be very inefficient. I have had success using dplyr::semi_join but that only works when the entries match exactly, whereas I am searching for a single word in a longer string:
fruit_lookup <- data.frame(fruit=c("banana","apple","pear","plum"), rating=c(3,4,3,5))
products <- data.frame(product_code=c("535A","535B","283G","786X","765G"), product_name=c("banana drops","apple juice","melon juice","coconut oil","strawberry jelly"))
results <- data.frame(product_code=NA, product_name=NA, fruit=NA, rating=NA)
for(i in 1:nrow(products)) {
for(j in 1:nrow(fruit_lookup)){
if(stringr::str_detect(products$product_name[i], fruit_lookup$fruit[j])) {
results <- tibble::add_row(results)
results$product_code[i] <- products$product_code[i]
results$product_name[i] <- products$product_name[i]
results$fruit[i] <- fruit_lookup$fruit[j]
results$rating[i] <- fruit_lookup$rating[j]
break
}
}
}
results <- stats::na.omit(results)
print(results)
This yields the result I am wanting:
product_code product_name fruit rating
535A banana drops banana 3
535B apple juice apple 4
Any advice gratefully received and I won't be hurt if I have missed something obvious. Please feel free to critique my other coding practices, which may not be ideal!
CodePudding user response:
a direct replacement for what you have might be
library(tidyverse)
products |>
mutate(prod_tokens = str_split(string = product_name,
pattern = " ")) |>
unnest_longer(col = prod_tokens) |>
inner_join(fruit_lookup, by = c("prod_tokens" = "fruit"))