How do I remove part of my character variable in R?-CodePudding

I have a dataset, where I've already created separate columns for cocktail ingredients, so one ingredient is shown in one column. Now I have variables like this:

ingredients <- c("1 1/2 oz Plymouth gin", "1 oz egg white", "3/4 oz lemon juice", "2 oz rye (50% abv)", "2 oz white rum (40% abv)", "3/4 oz lime juice", "3/4 oz honey syrup")

etc.

I'd need to clean it by removing all quantities (e.g. 1/2 oz, 2 dashes, etc.), and the indicator of alcohol content (e.g. 47.3% abv). I've tried to do it one by one (removing numbers, then removing "1/2" and "3/4", after removing the "oz", "dashes", "()" and the "%" and "abv"),

df %>%
mutate(ingredient1 = str_remove(ingredient1, "[[:digit:]] ")) %>%
  mutate(ingredient1 = str_remove(ingredient1, "oz"))

but it is a lot of work and I am quite sure that there is a more elegant and efficient solution.

I am looking for a solution where I can tell R to delete everything before and including "oz" or "dashes", and also remove everything that starts with "(".

CodePudding user response：

Here is a starting point how you could achieve your task:

library(dplyr)
library(stringr)
df %>% 
  mutate(across(everything(), ~sub(".*oz ", '', .))) %>%
  mutate(across(everything(), ~sub(".*OZ ", '', .))) %>% 
  mutate(across(everything(), ~str_replace(., " \\s*\\([^\\)] \\)", "")))

   ingredient1          ingredient2      ingredient3     
   <chr>                <chr>            <chr>           
 1 pisco                egg white        lime juice      
 2 Plymouth gin         egg white        lemon juice     
 3 Plymouth gin         egg white        Dolin dry vermo 
 4 rye                  simple syrup     lemon juice     
 5 white rum            lime juice       simple syrup    
 6 white rum            lime juice       honey syrup     
 7 white rum            lime juice       simple syrup    
 8 Scotch               Cherry Herring   sweet vermouth  
 9 Cognac               heavy cream      Demerara syrup  
10 white rum            lime juice       grapefruit juice
11 bourbon              grapefruit juice honey syrup     
12 Absolut Citron vodka Cointreau        cranberry juice 
13 bourbon              lemon juice      honey syrup

data:

structure(list(ingredient1 = c("2 oz pisco (40% abv)", "1 1/2 oz Plymouth gin", 
"2 oz Plymouth gin", "2 oz rye (50% abv)", "2 oz white rum (40% abv)", 
"2 oz white rum (40% abv)", "2 oz white rum (40% abv)", "1 oz Scotch (43% abv)", 
"2 oz Cognac (41% abv)", "2 oz white rum (40% abv)", "2 oz bourbon (45% abv)", 
"1 1/2 oz Absolut Citron vodka", "2 OZ bourbon (47% abv)"), ingredient2 = c("1 oz egg white", 
"1 oz egg white", "1 oz egg white", "3/4 oz simple syrup", "0.875 oz lime juice", 
"3/4 oz lime juice", "3/4 oz lime juice", "3/4 oz Cherry Herring", 
"1 oz heavy cream", "3/4 oz lime juice", "1 oz grapefruit juice", 
"3/4 oz Cointreau", "3/4 oz lemon juice"), ingredient3 = c("3/4 oz lime juice", 
"3/4 oz lemon juice", "1/2 oz Dolin dry vermo", "0.625 oz lemon juice", 
"3/4 oz simple syrup", "3/4 oz honey syrup", "3/4 oz simple syrup", 
"3/4 oz sweet vermouth", "1/4 oz Demerara syrup", "1/2 oz grapefruit juice", 
"1/2 oz honey syrup", "3/4 oz cranberry juice", "3/4 oz honey syrup"
)), row.names = c(NA, -13L), spec = structure(list(cols = list(
    ingredient1 = structure(list(), class = c("collector_character", 
    "collector")), ingredient2 = structure(list(), class = c("collector_character", 
    "collector")), ingredient3 = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = "\t"), class = "col_spec"), problems = <pointer: 0x00000179794ebf20>, class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"))

CodePudding user response：

you can use a look behind (or look ahead - I always get those confused :)

stringr::str_trim(gsub("(.*oz)?|\\(.*\\)","", ingredients))
#> [1] "Plymouth gin" "egg white"    "lemon juice"  "rye"          "white rum"   
#> [6] "lime juice"   "honey syrup"