I have the following data:
EDIT:
df<- data.frame(
id = c(432, 324, 322, 331, 242,443,223 ),
desc1= c("metal","steels&iron","irons\\copper", "sports material", "leather material", "durable goods", "electronic store")
,
store_names = c("ik bros","steel idrs", "kb materials", "ca pty (ltd)", "bkk stores", "k/k \\shop", "h/j & jj")
,
class = c("", "unknown","", "sports", "unknown", "unknown", "")
)
I want to search keywords from both desc1 and desc2 and assign a string value to class column. For example, keywords can be
indus_1 <- c("iron", "steel")
goods_store_1 <- c("goods", "store", "stores")
electr_1 <- c("electronic", "chips", "semiconductor")
unlabelled_1 <- c("leather")
here variable names indus_1, sports_1 and so on will be used to assign a string value to class. For instance, if "metal" keyword is found I assign indus after stripping away "_1" to class. In my approach, I am finding index of rows where keywords found and copying them to the copy of same dataframe, but this take quite long for a larger dataset, and may miss few classes as I am using \\b
to find exact match. Here is the expected output:
id desc1 store_names class
432 metal ik bros
324 steels&iron steel idrs indus
322 irons\\copper kb materials indus
331 sports material ca pty (ltd) sports
242 leather material bkk stores unlabelled
443 durable goods k/k \\shop goods_store
223 electronic store h/j & jj electr
I am looking for a more efficient method to do the same, a fully dplyr version would be preferable. Thanks for suggestions.
CodePudding user response:
Not sure if I've interpretted your question correctly; is this what you want to do?
library(tidyverse)
df<- data.frame(
id = c(432, 324, 322, 331, 242 ),
desc1 = c("iron and metal","sports material", "leather material", "durable goods", "electronic goods")
,
desc2 = c("ik bros", "ca pty (ltd)", "bkk stores", "k/k \\shop", "h/j & jj")
,
class = c("", "sports", "unknown", "unknown", "")
)
df2 <- df %>%
mutate(class = case_when(str_detect(desc1, "metal") | str_detect(desc2, "metal") ~ "indus",
str_detect(desc1, "sports") | str_detect(desc2, "sports") ~ "sports",
str_detect(desc1, "electronic") | str_detect(desc2, "electronic") ~ "electr",
str_detect(desc1, "goods") | str_detect(desc2, "goods") ~ "goods_store",
str_detect(desc1, "leather") | str_detect(desc2, "leather") ~ "unlabelled"))
df2
#> id desc1 desc2 class
#> 1 432 iron and metal ik bros indus
#> 2 324 sports material ca pty (ltd) sports
#> 3 322 leather material bkk stores unlabelled
#> 4 331 durable goods k/k \\shop goods_store
#> 5 242 electronic goods h/j & jj electr
Created on 2021-10-25 by the reprex package (v2.0.1)
CodePudding user response:
In that case, you could do:
vars_1 <- mget(ls(pattern = '_1'))
vars_1 <- vars_1[!grepl('vars', names(vars_1))]
pat <- sub("_1", "", names(vars_1))
names(pat) <- sprintf(".*(%s).*", unlist(vars_1))
df %>%
mutate(class = str_replace_all(invoke(str_c, across(starts_with('desc'))), pat))
id desc1 desc2 class
1 432 iron and metal ik bros indus
2 324 sports material ca pty (ltd) sports
3 322 leather material bkk stores unlabelled
4 331 durable goods k/k \\shop goods_store
5 242 electronic goods h/j & jj electr
CodePudding user response:
Logically my answer is similar to @Onyambu 's answer but with few tweaks.
library(tidyverse)
mget(ls(pattern = '_1')) %>%
stack() %>%
group_by(ind = sub('_1', '', ind)) %>%
summarise(values = sprintf('.*\\b(%s)\\b.*', paste0(values, collapse = '|'))) %>%
select(2, 1) %>%
deframe() -> pat
pat
#.*\\b(electronic|chips|semiconductor)\\b.* .*\\b(goods|store|stores)\\b.*
# "electr" "goods_store"
# .*\\b(iron|steel)\\b.* .*\\b(leather)\\b.*
# "indus" "unlabelled"
df %>%
mutate(class2 = str_replace_all(desc1, pat),
class2 = ifelse(desc1 == class2, '', class2))
# id desc1 store_names class class2
#1 432 metal ik bros
#2 324 steels&iron steel idrs unknown indus
#3 322 irons\\copper kb materials
#4 331 sports material ca pty (ltd) sports
#5 242 leather material bkk stores unknown unlabelled
#6 443 durable goods k/k \\shop unknown goods_store
#7 223 electronic store h/j & jj electr
For id = 322
it doesn't match indus
because we are looking for an exact match. indus_1
has iron
whereas desc1
column has irons
.