i have a dataframe which contains a text string like below that shows the ingredients and the proportion of each ingredient. What i would like to achive is to extract the proportion of each ingredient as a separate variable:
What i have:
given <- tibble(
ingredients =c("1.5BZ 1FZ 2HT","2FZ","0.5HT 2BZ")
)
What i want to achive:
to_achieve <- tibble(
ingredients =c("1.5BZ 1FZ 2HT","2FZ","0.5HT 2BZ"),
proportion_bz = c(1.5,0,2),
proportion_fz = c(1,2,0),
proportion_ht=c(2,2,0.5)
)
Please note there might be more than a dozen different ingredients and tidyverse methods are preferred.
Thanks in advance, Felix
CodePudding user response:
Making heavy use of tidyr
you could first split your strings into rows per ingredient using separate_rows
, afterwards extract
the numeric proportion and the type of ingredient and finally use pivot_wider
to reshape into your desired format:
library(dplyr)
library(tidyr)
given %>%
mutate(ingredients_split = ingredients) |>
tidyr::separate_rows(ingredients_split, sep = "\\ ") |>
tidyr::extract(
ingredients_split,
into = c("proportion", "ingredient"),
regex = "^([\\d \\.] )(.*)$"
) |>
mutate(
proportion = as.numeric(proportion),
ingredient = tolower(ingredient)
) |>
pivot_wider(
names_from = ingredient,
names_prefix = "proportion_",
values_from = proportion,
values_fill = 0
)
#> # A tibble: 3 × 4
#> ingredients proportion_bz proportion_fz proportion_ht
#> <chr> <dbl> <dbl> <dbl>
#> 1 1.5BZ 1FZ 2HT 1.5 1 2
#> 2 2FZ 0 2 0
#> 3 0.5HT 2BZ 2 0 0.5
CodePudding user response:
library(tidyr)
library(readr)
library(stringr)
library(janitor)
# SOLUTION -----
given %>%
separate(ingredients, into = c("a", "b", "c"), sep = "\\ ", remove = F) %>%
pivot_longer(a:c) %>%
select(-name) %>%
mutate(name = str_remove_all(value, "[0-9]|\\."),
value = parse_number(value)) %>%
na.omit() %>%
pivot_wider(names_prefix = "proportion_", values_fill = 0) %>%
clean_names()
# OUTPUT ----
#># A tibble: 3 × 4
#> ingredients proportion_bz proportion_fz proportion_ht
#> <chr> <dbl> <dbl> <dbl>
#>1 1.5BZ 1FZ 2HT 1.5 1 2
#>2 2FZ 0 2 0
#>3 0.5HT 2BZ 2 0 0.5