Extract digits from strings in R-CodePudding

i have a dataframe which contains a text string like below that shows the ingredients and the proportion of each ingredient. What i would like to achive is to extract the proportion of each ingredient as a separate variable:

What i have:

given <- tibble(
  ingredients =c("1.5BZ 1FZ 2HT","2FZ","0.5HT 2BZ")
)

What i want to achive:

to_achieve <- tibble(
  ingredients =c("1.5BZ 1FZ 2HT","2FZ","0.5HT 2BZ"),
  proportion_bz = c(1.5,0,2),
  proportion_fz = c(1,2,0),
  proportion_ht=c(2,2,0.5)
)

Please note there might be more than a dozen different ingredients and tidyverse methods are preferred.

Thanks in advance, Felix

CodePudding user response：

Making heavy use of tidyr you could first split your strings into rows per ingredient using separate_rows, afterwards extract the numeric proportion and the type of ingredient and finally use pivot_wider to reshape into your desired format:

library(dplyr)
library(tidyr)

given %>%
  mutate(ingredients_split = ingredients) |>
  tidyr::separate_rows(ingredients_split, sep = "\\ ") |>
  tidyr::extract(
    ingredients_split,
    into = c("proportion", "ingredient"),
    regex = "^([\\d \\.] )(.*)$"
  ) |>
  mutate(
    proportion = as.numeric(proportion),
    ingredient = tolower(ingredient)
  ) |>
  pivot_wider(
    names_from = ingredient,
    names_prefix = "proportion_",
    values_from = proportion,
    values_fill = 0
  )
#> # A tibble: 3 × 4
#>   ingredients   proportion_bz proportion_fz proportion_ht
#>   <chr>                 <dbl>         <dbl>         <dbl>
#> 1 1.5BZ 1FZ 2HT           1.5             1           2  
#> 2 2FZ                     0               2           0  
#> 3 0.5HT 2BZ               2               0           0.5

CodePudding user response：


library(tidyr)
library(readr)
library(stringr)
library(janitor)

# SOLUTION -----
given %>%
  separate(ingredients, into = c("a", "b", "c"), sep = "\\ ", remove = F) %>% 
  pivot_longer(a:c) %>% 
  select(-name) %>% 
  mutate(name = str_remove_all(value, "[0-9]|\\."),
         value = parse_number(value)) %>% 
  na.omit() %>% 
  pivot_wider(names_prefix = "proportion_", values_fill = 0) %>% 
  clean_names()

# OUTPUT ----
#># A tibble: 3 × 4
#>  ingredients   proportion_bz proportion_fz proportion_ht
#>  <chr>                 <dbl>         <dbl>         <dbl>
#>1 1.5BZ 1FZ 2HT           1.5             1           2  
#>2 2FZ                     0               2           0  
#>3 0.5HT 2BZ               2               0           0.5