I am trying to categorize messy data with a dataframe with four columns:
"company_name", which is the messy data I want to categorize "categories", which is the list of categories I want to put the messy data in "search", which are the keywords I want to search in the messy data "company_type", which will have the correct company type for each row
"company_name" "categories" "search" "company_type"
John landscaping Landscaping lawn NA
constructing pros Landscaping landscape NA
Brother Lawn care Cleaning clean NA
Top cleaning Painting paint NA
Emmas paints Construction construct NA
I want my final result to look like this:
"company_name" "categories" "search" "company_type"
John landscaping Landscaping lawn Landscaping
constructing pros Landscaping landscape Construction
Brother Lawn care Cleaning clean Landscaping
Top cleaning Painting paint Cleaning
Emmas paints Construction construct Painting
Does anyone know how I should proceed?
Thank you!
CodePudding user response:
This answer is likely going to be incomplete because it's not clear exactly how your company_type
is decided. My best guess right now is that it's based on the company_name
, if it has related terms or not. This will definitely have to change based on the data you are putting in, and I imagine there are cases in which a company name may be confusing... like if it were called "Lawn Construction" or something. Still, with the simple data frame given, I would just use a case_when()
statement in dplyr()
, like so:
library(dplyr)
dat<-data.frame("company_name" = c("John landscaping", "constructing pros", "Brother Lawn care", "Top cleaning", "Emmas paints"),
"categories" = c("Landscaping", "Landscaping", "Cleaning", "Painting", "Construction"),
"search" = c("lawn", "landscape", "clean", "paint", "construct"),
"company_type" = rep(NA, 5))
dat%>%
mutate(company_type = case_when(grepl("landscaping|Lawn", company_name) ~ "Landscaping",
grepl("constructing", company_name) ~ "Construction",
grepl("cleaning", company_name) ~ "Cleaning",
grepl("paints", company_name) ~ "Painting"))
Output:
company_name categories search company_type
1 John landscaping Landscaping lawn Landscaping
2 constructing pros Landscaping landscape Construction
3 Brother Lawn care Cleaning clean Landscaping
4 Top cleaning Painting paint Cleaning
5 Emmas paints Construction construct Painting