Home > Back-end >  Wrapping a string extract function in an ifelse statement
Wrapping a string extract function in an ifelse statement

Time:10-22

The question below is an extension of this question.

Example data

I have example data as follows:

library(data.table)
example_dat <- fread("var_nam description
      some_var this_is_some_var_kg
      other_var this_is_meters_for_another_var
      extra_var the_price_of_apples
      another_var cost_of_goods_sold")
example_dat$description  <- gsub("_", " ", example_dat$description)

       var_nam                    description
1:    some_var            this is some var kg
2:   other_var this is meters for another var
3:   extra_var            the price of apples
4: another_var             cost of goods sold

vector_of_units <- c("kg", "meters", "var")

Previous solutions

I first asked how to create a separate column in this data which looks for certain units listed in a vector (vector_of_units). One option is to use this answer by maydin. Which gets out all matches.

library(tidyverse)
setDT(example_dat)[, unit :=    unlist(lapply(example_dat$description,function(x) 
                    paste0(vector_of_units[str_detect(x,vector_of_units)],
                    collapse = ",")))]

       var_nam                    description       unit
1:    some_var            this is some var Kg     kg,var
2:   other_var this is meters for another var meters,var
3:   extra_var            the Price of apples           
4: another_var             cost of goods sold         

I also found this answer by langtang, which gets out the first match (which is actually preferable in my situation):

example_dat[, unit:=stringr::str_extract(description, paste0(vector_of_units,collapse = "|"))]

       var_nam                    description   unit
1:    some_var            this is some var kg    var
2:   other_var this is meters for another var meters
3:   extra_var            the price of apples   <NA>
4: another_var             cost of goods sold   <NA>

Based on vector of strings extract string from data.table column into new column

More flexibility with an ifelse statement

I would however like to have a little more flexibility.

Firstly, I would like to supply a vector of matches and a vector for pasting separately, so that I can change the hits in to something else:

vector_of_units_in <- c("kg", "meters", "var")
vector_of_units_out <- c("kilogram", "meters", "variable")

vector_of_units_euro <- c("cost", "price")
vector_of_units_euro_out <- "euro"

Secondly, I would like to be able to choose what happens when there is no hit. For example, when applying the solution by langtang, I want it to not overwrite the first variables with NA.

I have been trying to mess around with langtang's solution:

setDT(example_dat)[, unit := ifelse(!is.na(stringr::str_extract(description, vector_of_units_in)), paste0(vector_of_units_out, collapse = "|"), NA)]

# NA has been replaced by unit, so that it is not overwritten in case of no match
setDT(example_dat)[, unit := ifelse(!is.na(stringr::str_extract(description, vector_of_units_euro)), paste0(vector_of_units_euro_out, collapse = "|"), unit)]

But I end with this:

       var_nam                    description                     unit
1:    some_var            this is some var kg kilogram|meters|variable
2:   other_var this is meters for another var kilogram|meters|variable
3:   extra_var            the price of apples                     <NA>
4: another_var             cost of goods sold                     <NA>

How should I write this syntax?

Desired output

       var_nam                    description       unit
1:    some_var            this is some var Kg     kilogram
2:   other_var this is meters for another var     meters
3:   extra_var            the Price of apples     euro      
4: another_var             cost of goods sold     euro    

CodePudding user response:

You could use a named units vector and Vectorize grep for outer. In a case handling if no matches found, we can throw NA.

units <- c(kilogram="kg", meters="meters", euro="cost", euro="price", variable='var')

dat[, unit:=apply(outer(units, description, Vectorize(grepl)), 2, \(x) 
                  if (any(x)) names(which(x)) else NA)]
dat
# var_nam                    description              unit
# 1:    some_var            this is some var kg kilogram,variable
# 2:    some_var                   this is some                NA
# 3:   other_var this is meters for another var   meters,variable
# 4:   extra_var            the price of apples              euro
# 5: another_var             cost of goods sold              euro

Data:

dat <- structure(list(var_nam = c("some_var", "some_var", "other_var", 
"extra_var", "another_var"), description = c("this is some var kg", 
"this is some var", "this is meters for another var", "the price of apples", 
"cost of goods sold")), row.names = c(NA, -5L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x558a7b025230>)

CodePudding user response:

Since you don't tag data.table explicitly, you may be open to a tidyverse solution as well:

First define a vector of units you want to match:

units_to_match <- c("kg", "meters", "cost", "price")

Then define a vector with the replacements:

units_to_replace <- setNames(c("kilogram", "meters", "euro", "euro"),
                             c("kg", "meters", "cost", "price"))

Now first extract the matches using str_extract_all and then replace them using str_replace_all:

library(tidyverse)    
example_dat %>%
  mutate(unit = str_extract_all(description, str_c(units_to_match, collapse = "|")),
         unit = str_replace_all(unit, units_to_replace))
       var_nam                    description     unit
1:    some_var            this_is_some_var_kg kilogram
2:   other_var this_is_meters_for_another_var   meters
3:   extra_var            the_price_of_apples     euro
4: another_var             cost_of_goods_sold     euro
  • Related