Within my dataframe I have a column with allergens per product. The values within the column 'allergens' are a wide range (milk, peanuts, soja, eggs etc.) and also mentioned in different languages.
Example df:
ID Product Allergens
1 cow milk Milk, Milch, Lait
2 Almond milk Milk, Milk, Almond, Nuts
3 Soja milk Soja, soja, lait, Lait, Milch
4 Fried Cheese WHEAT, MILK, MILK, MILK, WHEAT, WHEAT, MILK, CHEESE, MILK, CHEESE, CHEESE, MILK, CHEESE, CHEESE, MILK, CHEESE, WHEAT, BUTTER, CREAM
As in the above df
the allergens are not always all lowercase or uppercase, for example: Lait
and lait
.
What would be the most efficient way (the dataset contains almost 250k observations) to mutate all this data and remove duplicates (including the foreign words).
ID Product Allergens
3 Soja milk Soja, Milk
4 Fried Cheese Wheat, Milk, Cheese, Butter, Cream
CodePudding user response:
library(tidyverse)
data <- tribble(
~ID, ~Product, ~Allergens,
1, "cow milk", "Milk, Milch, Lait",
2, "Almond milk", "Milk, Milk, Almond, Nuts",
3, "Soja milk", "Soja, soja, lait, Lait, Milch",
4, "Fried Cheese", "WHEAT, MILK, MILK, MILK, WHEAT, WHEAT, MILK, CHEESE, MILK, CHEESE, CHEESE, MILK, CHEESE, CHEESE, MILK, CHEESE, WHEAT, BUTTER, CREAM"
)
data
#> # A tibble: 4 x 3
#> ID Product Allergens
#> <dbl> <chr> <chr>
#> 1 1 cow milk Milk, Milch, Lait
#> 2 2 Almond milk Milk, Milk, Almond, Nuts
#> 3 3 Soja milk Soja, soja, lait, Lait, Milch
#> 4 4 Fried Chee… WHEAT, MILK, MILK, MILK, WHEAT, WHEAT, MILK, CHEESE, MILK, …
unify <- function(x) {
x %>% recode(
"milch" = "milk",
"lait" = "milk"
)
}
data %>%
mutate(Allergens = Allergens %>% str_split(",")) %>%
unnest(Allergens) %>%
mutate(
Allergens = Allergens %>% map_chr(~ .x %>%
str_trim() %>%
tolower() %>%
unique() %>%
unify())
) %>%
group_by(ID, Product) %>%
nest() %>%
mutate(
data = data %>% map_chr(~ .x %>%
pull(Allergens) %>%
unique() %>%
paste(collapse = ", "))
)
#> # A tibble: 4 x 3
#> # Groups: ID, Product [4]
#> ID Product data
#> <dbl> <chr> <chr>
#> 1 1 cow milk milk
#> 2 2 Almond milk milk, almond, nuts
#> 3 3 Soja milk soja, milk
#> 4 4 Fried Cheese wheat, milk, cheese, butter, cream
Created on 2021-11-09 by the reprex package (v2.0.1)