I have a dataset, where I've already created separate columns for cocktail ingredients, so one ingredient is shown in one column. Now I have variables like this:
ingredients <- c("1 1/2 oz Plymouth gin", "1 oz egg white", "3/4 oz lemon juice", "2 oz rye (50% abv)", "2 oz white rum (40% abv)", "3/4 oz lime juice", "3/4 oz honey syrup")
etc.
I'd need to clean it by removing all quantities (e.g. 1/2 oz, 2 dashes, etc.), and the indicator of alcohol content (e.g. 47.3% abv). I've tried to do it one by one (removing numbers, then removing "1/2" and "3/4", after removing the "oz", "dashes", "()" and the "%" and "abv"),
df %>%
mutate(ingredient1 = str_remove(ingredient1, "[[:digit:]] ")) %>%
mutate(ingredient1 = str_remove(ingredient1, "oz"))
but it is a lot of work and I am quite sure that there is a more elegant and efficient solution.
I am looking for a solution where I can tell R to delete everything before and including "oz" or "dashes", and also remove everything that starts with "(".
CodePudding user response:
Here is a starting point how you could achieve your task:
library(dplyr)
library(stringr)
df %>%
mutate(across(everything(), ~sub(".*oz ", '', .))) %>%
mutate(across(everything(), ~sub(".*OZ ", '', .))) %>%
mutate(across(everything(), ~str_replace(., " \\s*\\([^\\)] \\)", "")))
ingredient1 ingredient2 ingredient3
<chr> <chr> <chr>
1 pisco egg white lime juice
2 Plymouth gin egg white lemon juice
3 Plymouth gin egg white Dolin dry vermo
4 rye simple syrup lemon juice
5 white rum lime juice simple syrup
6 white rum lime juice honey syrup
7 white rum lime juice simple syrup
8 Scotch Cherry Herring sweet vermouth
9 Cognac heavy cream Demerara syrup
10 white rum lime juice grapefruit juice
11 bourbon grapefruit juice honey syrup
12 Absolut Citron vodka Cointreau cranberry juice
13 bourbon lemon juice honey syrup
data:
structure(list(ingredient1 = c("2 oz pisco (40% abv)", "1 1/2 oz Plymouth gin",
"2 oz Plymouth gin", "2 oz rye (50% abv)", "2 oz white rum (40% abv)",
"2 oz white rum (40% abv)", "2 oz white rum (40% abv)", "1 oz Scotch (43% abv)",
"2 oz Cognac (41% abv)", "2 oz white rum (40% abv)", "2 oz bourbon (45% abv)",
"1 1/2 oz Absolut Citron vodka", "2 OZ bourbon (47% abv)"), ingredient2 = c("1 oz egg white",
"1 oz egg white", "1 oz egg white", "3/4 oz simple syrup", "0.875 oz lime juice",
"3/4 oz lime juice", "3/4 oz lime juice", "3/4 oz Cherry Herring",
"1 oz heavy cream", "3/4 oz lime juice", "1 oz grapefruit juice",
"3/4 oz Cointreau", "3/4 oz lemon juice"), ingredient3 = c("3/4 oz lime juice",
"3/4 oz lemon juice", "1/2 oz Dolin dry vermo", "0.625 oz lemon juice",
"3/4 oz simple syrup", "3/4 oz honey syrup", "3/4 oz simple syrup",
"3/4 oz sweet vermouth", "1/4 oz Demerara syrup", "1/2 oz grapefruit juice",
"1/2 oz honey syrup", "3/4 oz cranberry juice", "3/4 oz honey syrup"
)), row.names = c(NA, -13L), spec = structure(list(cols = list(
ingredient1 = structure(list(), class = c("collector_character",
"collector")), ingredient2 = structure(list(), class = c("collector_character",
"collector")), ingredient3 = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = "\t"), class = "col_spec"), problems = <pointer: 0x00000179794ebf20>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
CodePudding user response:
you can use a look behind (or look ahead - I always get those confused :)
stringr::str_trim(gsub("(.*oz)?|\\(.*\\)","", ingredients))
#> [1] "Plymouth gin" "egg white" "lemon juice" "rye" "white rum"
#> [6] "lime juice" "honey syrup"