Home > Software engineering >  Categorize rows in dataframe (with string variable) based on string data in another dataframe
Categorize rows in dataframe (with string variable) based on string data in another dataframe

Time:10-25

I have a dataframe consisting of search terms and corresponding classifications which I want to assign to a new string variable in a large data frame (each about 130 variables, 30 000 rows), based on the data in it (using a partial string match).

Unfortunately I cannot post the actual data, but I created some sample code below. The large df is represented by MainDF and the classification df by CatDF. The aim is to create the column Category in MainDF, based of data from CatDF.

So far I have come up with this solution - BUT it is not efficient and takes 32sec for one search in my actual data, which is way too long as I have about 300 search terms I need to conduct:

library(dplyr)

MainDF <- data.frame (col1  = c("Tables", "Chairs", "Computer monitors", "Lounge suite", "Computer Monitors", "Deck chairs", "Office chairs", "TV monitors", "Side tables"),
                      col2 = c("Wooden table","Plastic chair","LG monitor","Couch","Samsung screen","Plastic chair","Ergonomic chair", "LG monitor G234","Wooden table"))

CatDF<-data.frame(SearchTerm=c("Chair","Monitor","Screen","Table","TV"),
                  NewCategory=c("Tables/Chairs","Screens/Monitor","Screens/Monitor","Tables/Chairs","Screens/Monitor"))


MainDF$Category=NA
for (i in 1:nrow(CatDF)){
  a <- transform(
    as.data.frame(
      which(matrix(grepl(CatDF$SearchTerm[i], as.matrix(MainDF[,c(1:2)]),ignore.case = TRUE), nrow = nrow(MainDF)),
            arr.ind = TRUE
      )))
  a<-a %>% distinct(row)
  MainDF[a$row,"Category"]=CatDF$NewCategory[i]
}

Is there a more efficient solution to this? I know loops are generally inefficient but I cannot think of another way of doing this.

Thank you!

CodePudding user response:

Here an example with package fuzzyjoin and function regex_left_join:


library(fuzzyjoin)

MainDF <- data.frame (col1  = c("Tables", "Chairs", "Computer monitors", 
                                "Lounge suite", "Computer Monitors", 
                                "Deck chairs", "Office chairs", 
                                "TV monitors", "Side tables"),
                      col2 = c("Wooden table", "Plastic chair", "LG monitor",
                               "Couch", "Samsung screen", "Plastic chair",
                               "Ergonomic chair", "LG monitor G234",
                               "Wooden table"))

CatDF < -data.frame(SearchTerm =  c("Chair", "Monitor", "Screen", "Table","TV"),
                    NewCategory = c("Tables/Chairs", "Screens/Monitor",
                                    "Screens/Monitor", "Tables/Chairs",
                                    "Screens/Monitor"))

regex_left_join(MainDF, CatDF, by = c(col1  = "SearchTerm"), ignore_case=TRUE)

It works for almost alle examples, except one, where it should be sufficient to add just another row to CatDF.

CodePudding user response:

Here's another solution based on fuzzyjoin:

library(fuzzyjoin)
library(stringr)

fuzzy_join(
  MainDF %>%
    mutate(across(everything(), ~tolower(.))), 
  CatDF %>%
    mutate(across(everything(), ~tolower(.))),
  by  = c("col1" = "SearchTerm"),
  match_fun = str_detect,
  mode = "left"
) 
                col1            col2 SearchTerm     NewCategory
1             tables    wooden table      table   tables/chairs
2             chairs   plastic chair      chair   tables/chairs
3  computer monitors      lg monitor    monitor screens/monitor
4       lounge suite           couch       <NA>            <NA>
5  computer monitors  samsung screen    monitor screens/monitor
6        deck chairs   plastic chair      chair   tables/chairs
7      office chairs ergonomic chair      chair   tables/chairs
8        tv monitors lg monitor g234    monitor screens/monitor
9        tv monitors lg monitor g234         tv screens/monitor
10       side tables    wooden table      table   tables/chairs
  • Related