I have a list of products (500.000 ) and want to search a column to match certain keywords (5.000 ). In the end I want to have only those products/lines left which contain at least one of these searchterm. Could also be multiple ones per product. I managed to just filter down to the required data, BUT I also want to have a column where I see WHY it has been selected as this is not obvious in every case (with 5.000 searchterms). The final code should run on a standard laptop (core I5,16 GB RAM).
#Example data
SearchTerms2<-c("Lamborghini|Ferrari")
df<-data.frame(LineCode = c(1:7),
Product = c("Lamborghini Urus & Ferrari Purosangue", "Ferrari
Purosangue","Lamborghini Huracan","Skoda Rapid","Skoda Kodiaq","Audi A6 Avant","Lamborghini
Urus"))
dffilterd <- filter(df,grepl(SearchTerms2,df$Product))
#Desired outcome:
1;Lamborghini Urus & Ferrari Purosangue;Lamborghini/Ferrari
2;Ferrari Purosangue;Ferrari
3;Lamborghini Huracan;Lamborghini
7;Lamborghini Urus;Lamborghini
At the moment I can only imagine a really large table with 500.000 lines & 5.000 columns and running some function over the whole table, but I assume this will be waste of memory not the fastest solution for a problem like this.
CodePudding user response:
Here a solution that uses tokenizers
to separate words in Product. This way does not depends on grep every match. I use data.table for speed given the huge dataframe you have.
library(tokenizers)
library(data.table)
searchTerms <- c("Lamborghini","Ferrari")
df<-data.frame(LineCode = c(1:7),Product = c(
"Lamborghini Urus & Ferrari Purosangue",
"Ferrari Purosangue","Lamborghini Huracan","Skoda Rapid",
"Skoda Kodiaq","Audi A6 Avant","Lamborghini Urus"))
setDT(df)
setkey(df, LineCode)
df_tokens <- df[, .(words=tolower(unique(tokenize_words(Product, simplify = T)))),
by=LineCode]
searchTerms <- tolower(searchTerms)
df_tokens<- df_tokens[words %in% searchTerms, .(matches=paste(words, collapse="/")),
keyby = LineCode]
df_tokens <- df_tokens[df, .(LineCode, Product, matches), on = "LineCode"]
df_tokens
#> Key: <LineCode>
#> LineCode Product matches
#> <int> <char> <char>
#> 1: 1 Lamborghini Urus & Ferrari Purosangue lamborghini/ferrari
#> 2: 2 Ferrari Purosangue ferrari
#> 3: 3 Lamborghini Huracan lamborghini
#> 4: 4 Skoda Rapid <NA>
#> 5: 5 Skoda Kodiaq <NA>
#> 6: 6 Audi A6 Avant <NA>
#> 7: 7 Lamborghini Urus lamborghini