Find string in other string and if succesful show searched string(s)-CodePudding

I have a list of products (500.000 ) and want to search a column to match certain keywords (5.000 ). In the end I want to have only those products/lines left which contain at least one of these searchterm. Could also be multiple ones per product. I managed to just filter down to the required data, BUT I also want to have a column where I see WHY it has been selected as this is not obvious in every case (with 5.000 searchterms). The final code should run on a standard laptop (core I5,16 GB RAM).

#Example data

SearchTerms2<-c("Lamborghini|Ferrari")

df<-data.frame(LineCode = c(1:7),
               Product = c("Lamborghini Urus & Ferrari Purosangue", "Ferrari 
Purosangue","Lamborghini Huracan","Skoda Rapid","Skoda Kodiaq","Audi A6 Avant","Lamborghini 
Urus"))

dffilterd <- filter(df,grepl(SearchTerms2,df$Product))

#Desired outcome:

1;Lamborghini Urus & Ferrari Purosangue;Lamborghini/Ferrari

2;Ferrari Purosangue;Ferrari

3;Lamborghini Huracan;Lamborghini

7;Lamborghini Urus;Lamborghini

At the moment I can only imagine a really large table with 500.000 lines & 5.000 columns and running some function over the whole table, but I assume this will be waste of memory not the fastest solution for a problem like this.

CodePudding user response：

Here a solution that uses tokenizers to separate words in Product. This way does not depends on grep every match. I use data.table for speed given the huge dataframe you have.

library(tokenizers)
library(data.table)

searchTerms <- c("Lamborghini","Ferrari")

df<-data.frame(LineCode = c(1:7),Product = c(
    "Lamborghini Urus & Ferrari Purosangue",
    "Ferrari Purosangue","Lamborghini Huracan","Skoda Rapid",
    "Skoda Kodiaq","Audi A6 Avant","Lamborghini Urus"))
setDT(df)
setkey(df, LineCode)

df_tokens <- df[, .(words=tolower(unique(tokenize_words(Product, simplify = T)))), 
                by=LineCode]
searchTerms <- tolower(searchTerms)
df_tokens<- df_tokens[words %in% searchTerms, .(matches=paste(words, collapse="/")),
                keyby = LineCode]

df_tokens <- df_tokens[df, .(LineCode, Product, matches), on = "LineCode"]

df_tokens
#> Key: <LineCode>
#>    LineCode                               Product             matches
#>       <int>                                <char>              <char>
#> 1:        1 Lamborghini Urus & Ferrari Purosangue lamborghini/ferrari
#> 2:        2                    Ferrari Purosangue             ferrari
#> 3:        3                   Lamborghini Huracan         lamborghini
#> 4:        4                           Skoda Rapid                <NA>
#> 5:        5                          Skoda Kodiaq                <NA>
#> 6:        6                         Audi A6 Avant                <NA>
#> 7:        7                      Lamborghini Urus         lamborghini