I have a dataframe consisting of search terms and corresponding classifications which I want to assign to a new string variable in a large data frame (each about 130 variables, 30 000 rows), based on the data in it (using a partial string match).
Unfortunately I cannot post the actual data, but I created some sample code below. The large df is represented by MainDF and the classification df by CatDF. The aim is to create the column Category in MainDF, based of data from CatDF.
So far I have come up with this solution - BUT it is not efficient and takes 32sec for one search in my actual data, which is way too long as I have about 300 search terms I need to conduct:
library(dplyr)
MainDF <- data.frame (col1 = c("Tables", "Chairs", "Computer monitors", "Lounge suite", "Computer Monitors", "Deck chairs", "Office chairs", "TV monitors", "Side tables"),
col2 = c("Wooden table","Plastic chair","LG monitor","Couch","Samsung screen","Plastic chair","Ergonomic chair", "LG monitor G234","Wooden table"))
CatDF<-data.frame(SearchTerm=c("Chair","Monitor","Screen","Table","TV"),
NewCategory=c("Tables/Chairs","Screens/Monitor","Screens/Monitor","Tables/Chairs","Screens/Monitor"))
MainDF$Category=NA
for (i in 1:nrow(CatDF)){
a <- transform(
as.data.frame(
which(matrix(grepl(CatDF$SearchTerm[i], as.matrix(MainDF[,c(1:2)]),ignore.case = TRUE), nrow = nrow(MainDF)),
arr.ind = TRUE
)))
a<-a %>% distinct(row)
MainDF[a$row,"Category"]=CatDF$NewCategory[i]
}
Is there a more efficient solution to this? I know loops are generally inefficient but I cannot think of another way of doing this.
Thank you!
CodePudding user response:
Here an example with package fuzzyjoin and function regex_left_join
:
library(fuzzyjoin)
MainDF <- data.frame (col1 = c("Tables", "Chairs", "Computer monitors",
"Lounge suite", "Computer Monitors",
"Deck chairs", "Office chairs",
"TV monitors", "Side tables"),
col2 = c("Wooden table", "Plastic chair", "LG monitor",
"Couch", "Samsung screen", "Plastic chair",
"Ergonomic chair", "LG monitor G234",
"Wooden table"))
CatDF < -data.frame(SearchTerm = c("Chair", "Monitor", "Screen", "Table","TV"),
NewCategory = c("Tables/Chairs", "Screens/Monitor",
"Screens/Monitor", "Tables/Chairs",
"Screens/Monitor"))
regex_left_join(MainDF, CatDF, by = c(col1 = "SearchTerm"), ignore_case=TRUE)
It works for almost alle examples, except one, where it should be sufficient to add just another row to CatDF
.
CodePudding user response:
Here's another solution based on fuzzyjoin
:
library(fuzzyjoin)
library(stringr)
fuzzy_join(
MainDF %>%
mutate(across(everything(), ~tolower(.))),
CatDF %>%
mutate(across(everything(), ~tolower(.))),
by = c("col1" = "SearchTerm"),
match_fun = str_detect,
mode = "left"
)
col1 col2 SearchTerm NewCategory
1 tables wooden table table tables/chairs
2 chairs plastic chair chair tables/chairs
3 computer monitors lg monitor monitor screens/monitor
4 lounge suite couch <NA> <NA>
5 computer monitors samsung screen monitor screens/monitor
6 deck chairs plastic chair chair tables/chairs
7 office chairs ergonomic chair chair tables/chairs
8 tv monitors lg monitor g234 monitor screens/monitor
9 tv monitors lg monitor g234 tv screens/monitor
10 side tables wooden table table tables/chairs