Home > Software engineering >  Mutate multiple comma delimited variable
Mutate multiple comma delimited variable

Time:11-10

Within my dataframe I have a column with allergens per product. The values within the column 'allergens' are a wide range (milk, peanuts, soja, eggs etc.) and also mentioned in different languages.

Example df:

ID    Product         Allergens
1     cow milk        Milk, Milch, Lait
2     Almond milk     Milk, Milk, Almond, Nuts
3     Soja milk       Soja, soja, lait, Lait, Milch
4     Fried Cheese    WHEAT, MILK, MILK, MILK, WHEAT, WHEAT, MILK, CHEESE, MILK, CHEESE, CHEESE, MILK, CHEESE, CHEESE, MILK, CHEESE, WHEAT, BUTTER, CREAM

As in the above df the allergens are not always all lowercase or uppercase, for example: Lait and lait.

What would be the most efficient way (the dataset contains almost 250k observations) to mutate all this data and remove duplicates (including the foreign words).

ID    Product         Allergens
3     Soja milk       Soja, Milk
4     Fried Cheese    Wheat, Milk, Cheese, Butter, Cream

CodePudding user response:

library(tidyverse)

data <- tribble(
  ~ID, ~Product, ~Allergens,
  1, "cow milk", "Milk, Milch, Lait",
  2, "Almond milk", "Milk, Milk, Almond, Nuts",
  3, "Soja milk", "Soja, soja, lait, Lait, Milch",
  4, "Fried Cheese", "WHEAT, MILK, MILK, MILK, WHEAT, WHEAT, MILK, CHEESE, MILK, CHEESE, CHEESE, MILK, CHEESE, CHEESE, MILK, CHEESE, WHEAT, BUTTER, CREAM"
)
data
#> # A tibble: 4 x 3
#>      ID Product     Allergens                                                   
#>   <dbl> <chr>       <chr>                                                       
#> 1     1 cow milk    Milk, Milch, Lait                                           
#> 2     2 Almond milk Milk, Milk, Almond, Nuts                                    
#> 3     3 Soja milk   Soja, soja, lait, Lait, Milch                               
#> 4     4 Fried Chee… WHEAT, MILK, MILK, MILK, WHEAT, WHEAT, MILK, CHEESE, MILK, …

unify <- function(x) {
  x %>% recode(
    "milch" = "milk",
    "lait" = "milk"
  )
}

data %>%
  mutate(Allergens = Allergens %>% str_split(",")) %>%
  unnest(Allergens) %>%
  mutate(
    Allergens = Allergens %>% map_chr(~ .x %>%
      str_trim() %>%
      tolower() %>%
      unique() %>%
      unify())
  ) %>%
  group_by(ID, Product) %>%
  nest() %>%
  mutate(
    data = data %>% map_chr(~ .x %>%
      pull(Allergens) %>%
      unique() %>%
      paste(collapse = ", "))
  )
#> # A tibble: 4 x 3
#> # Groups:   ID, Product [4]
#>      ID Product      data                              
#>   <dbl> <chr>        <chr>                             
#> 1     1 cow milk     milk                              
#> 2     2 Almond milk  milk, almond, nuts                
#> 3     3 Soja milk    soja, milk                        
#> 4     4 Fried Cheese wheat, milk, cheese, butter, cream

Created on 2021-11-09 by the reprex package (v2.0.1)

  • Related