Home > Enterprise >  R categorize messy data through string matching
R categorize messy data through string matching

Time:07-29

I am trying to categorize messy data with a dataframe with four columns:

"company_name", which is the messy data I want to categorize "categories", which is the list of categories I want to put the messy data in "search", which are the keywords I want to search in the messy data "company_type", which will have the correct company type for each row

     "company_name"           "categories"      "search"    "company_type" 

  John landscaping           Landscaping         lawn          NA  
  constructing pros          Landscaping         landscape     NA 
  Brother Lawn care          Cleaning            clean         NA 
  Top cleaning               Painting            paint         NA     
  Emmas paints               Construction        construct     NA

I want my final result to look like this:

     "company_name"           "categories"      "search"    "company_type" 

  John landscaping          Landscaping         lawn          Landscaping 
  constructing pros         Landscaping         landscape     Construction 
  Brother Lawn care         Cleaning            clean         Landscaping 
  Top cleaning              Painting            paint         Cleaning     
  Emmas paints              Construction        construct     Painting

Does anyone know how I should proceed?

Thank you!

CodePudding user response:

This answer is likely going to be incomplete because it's not clear exactly how your company_type is decided. My best guess right now is that it's based on the company_name, if it has related terms or not. This will definitely have to change based on the data you are putting in, and I imagine there are cases in which a company name may be confusing... like if it were called "Lawn Construction" or something. Still, with the simple data frame given, I would just use a case_when() statement in dplyr(), like so:

library(dplyr)
dat<-data.frame("company_name" = c("John landscaping", "constructing pros", "Brother Lawn care", "Top cleaning", "Emmas paints"),
                "categories" = c("Landscaping", "Landscaping", "Cleaning", "Painting", "Construction"),
                "search" = c("lawn", "landscape", "clean", "paint", "construct"),
                "company_type" = rep(NA, 5))

dat%>%
  mutate(company_type = case_when(grepl("landscaping|Lawn", company_name) ~ "Landscaping",
                                  grepl("constructing", company_name) ~ "Construction",
                                  grepl("cleaning", company_name) ~ "Cleaning",
                                  grepl("paints", company_name) ~ "Painting"))

Output:

       company_name   categories    search company_type
1  John landscaping  Landscaping      lawn  Landscaping
2 constructing pros  Landscaping landscape Construction
3 Brother Lawn care     Cleaning     clean  Landscaping
4      Top cleaning     Painting     paint     Cleaning
5      Emmas paints Construction construct     Painting
  • Related