Home > Software engineering >  Get row index to update another dataframe in loop
Get row index to update another dataframe in loop

Time:10-25

I have the following data:

EDIT:

 df<- data.frame(
  id = c(432, 324, 322, 331, 242,443,223 ), 
  
  desc1= c("metal","steels&iron","irons\\copper", "sports material", "leather material", "durable goods", "electronic store")
  ,
  store_names = c("ik bros","steel idrs", "kb materials",  "ca pty (ltd)",  "bkk stores", "k/k \\shop", "h/j & jj")
  ,
  class = c("", "unknown","",  "sports", "unknown", "unknown", "") 
)

I want to search keywords from both desc1 and desc2 and assign a string value to class column. For example, keywords can be

indus_1 <- c("iron", "steel")
goods_store_1 <- c("goods", "store", "stores")
electr_1 <- c("electronic", "chips", "semiconductor")
unlabelled_1 <- c("leather")

here variable names indus_1, sports_1 and so on will be used to assign a string value to class. For instance, if "metal" keyword is found I assign indus after stripping away "_1" to class. In my approach, I am finding index of rows where keywords found and copying them to the copy of same dataframe, but this take quite long for a larger dataset, and may miss few classes as I am using \\b to find exact match. Here is the expected output:

  id            desc1  store_names       class
  432            metal      ik bros            
  324      steels&iron   steel idrs       indus
  322    irons\\copper kb materials       indus     
  331  sports material ca pty (ltd)      sports
  242 leather material   bkk stores  unlabelled
  443    durable goods   k/k \\shop goods_store
  223 electronic store     h/j & jj      electr

I am looking for a more efficient method to do the same, a fully dplyr version would be preferable. Thanks for suggestions.

CodePudding user response:

Not sure if I've interpretted your question correctly; is this what you want to do?

library(tidyverse)

df<- data.frame(
  id = c(432, 324, 322, 331, 242 ), 
  
  desc1 = c("iron and metal","sports material", "leather material", "durable goods", "electronic goods")
  ,
  desc2 = c("ik bros", "ca pty (ltd)",  "bkk stores", "k/k \\shop", "h/j & jj")
  ,
  class = c("", "sports", "unknown", "unknown", "") 
)

df2 <- df %>%
  mutate(class = case_when(str_detect(desc1, "metal") | str_detect(desc2, "metal")  ~ "indus",
                           str_detect(desc1, "sports") | str_detect(desc2, "sports") ~ "sports",
                           str_detect(desc1, "electronic") | str_detect(desc2, "electronic") ~ "electr",
                           str_detect(desc1, "goods") | str_detect(desc2, "goods") ~ "goods_store",
                           str_detect(desc1, "leather") | str_detect(desc2, "leather") ~ "unlabelled"))

df2
#>    id            desc1        desc2       class
#> 1 432   iron and metal      ik bros       indus
#> 2 324  sports material ca pty (ltd)      sports
#> 3 322 leather material   bkk stores  unlabelled
#> 4 331    durable goods   k/k \\shop goods_store
#> 5 242 electronic goods     h/j & jj      electr

Created on 2021-10-25 by the reprex package (v2.0.1)

CodePudding user response:

In that case, you could do:

vars_1 <- mget(ls(pattern = '_1'))
vars_1 <- vars_1[!grepl('vars', names(vars_1))]
pat <- sub("_1", "", names(vars_1))
names(pat) <- sprintf(".*(%s).*", unlist(vars_1))

df %>%
  mutate(class = str_replace_all(invoke(str_c, across(starts_with('desc'))), pat))

  id            desc1        desc2       class
1 432   iron and metal      ik bros       indus
2 324  sports material ca pty (ltd)      sports
3 322 leather material   bkk stores  unlabelled
4 331    durable goods   k/k \\shop goods_store
5 242 electronic goods     h/j & jj      electr

CodePudding user response:

Logically my answer is similar to @Onyambu 's answer but with few tweaks.

library(tidyverse)

mget(ls(pattern = '_1')) %>%
  stack() %>%
  group_by(ind = sub('_1', '', ind)) %>%
  summarise(values = sprintf('.*\\b(%s)\\b.*', paste0(values, collapse = '|'))) %>%
  select(2, 1) %>%
  deframe() -> pat

pat
#.*\\b(electronic|chips|semiconductor)\\b.*   .*\\b(goods|store|stores)\\b.*
#                                  "electr"           "goods_store" 
#                   .*\\b(iron|steel)\\b.*         .*\\b(leather)\\b.*
#                                   "indus"             "unlabelled"
df %>%
  mutate(class2 = str_replace_all(desc1, pat), 
         class2 = ifelse(desc1 == class2, '', class2))

#   id            desc1  store_names   class      class2
#1 432            metal      ik bros                    
#2 324      steels&iron   steel idrs unknown       indus
#3 322    irons\\copper kb materials                    
#4 331  sports material ca pty (ltd)  sports            
#5 242 leather material   bkk stores unknown  unlabelled
#6 443    durable goods   k/k \\shop unknown goods_store
#7 223 electronic store     h/j & jj              electr

For id = 322 it doesn't match indus because we are looking for an exact match. indus_1 has iron whereas desc1 column has irons.

  •  Tags:  
  • r
  • Related