Home > front end >  Extract digits from strings in R
Extract digits from strings in R

Time:12-14

i have a dataframe which contains a text string like below that shows the ingredients and the proportion of each ingredient. What i would like to achive is to extract the proportion of each ingredient as a separate variable:

What i have:

given <- tibble(
  ingredients =c("1.5BZ 1FZ 2HT","2FZ","0.5HT 2BZ")
)

What i want to achive:

to_achieve <- tibble(
  ingredients =c("1.5BZ 1FZ 2HT","2FZ","0.5HT 2BZ"),
  proportion_bz = c(1.5,0,2),
  proportion_fz = c(1,2,0),
  proportion_ht=c(2,2,0.5)
)

Please note there might be more than a dozen different ingredients and tidyverse methods are preferred.

Thanks in advance, Felix

CodePudding user response:

Making heavy use of tidyr you could first split your strings into rows per ingredient using separate_rows, afterwards extract the numeric proportion and the type of ingredient and finally use pivot_wider to reshape into your desired format:

library(dplyr)
library(tidyr)

given %>%
  mutate(ingredients_split = ingredients) |>
  tidyr::separate_rows(ingredients_split, sep = "\\ ") |>
  tidyr::extract(
    ingredients_split,
    into = c("proportion", "ingredient"),
    regex = "^([\\d \\.] )(.*)$"
  ) |>
  mutate(
    proportion = as.numeric(proportion),
    ingredient = tolower(ingredient)
  ) |>
  pivot_wider(
    names_from = ingredient,
    names_prefix = "proportion_",
    values_from = proportion,
    values_fill = 0
  )
#> # A tibble: 3 × 4
#>   ingredients   proportion_bz proportion_fz proportion_ht
#>   <chr>                 <dbl>         <dbl>         <dbl>
#> 1 1.5BZ 1FZ 2HT           1.5             1           2  
#> 2 2FZ                     0               2           0  
#> 3 0.5HT 2BZ               2               0           0.5

CodePudding user response:


library(tidyr)
library(readr)
library(stringr)
library(janitor)

# SOLUTION -----
given %>%
  separate(ingredients, into = c("a", "b", "c"), sep = "\\ ", remove = F) %>% 
  pivot_longer(a:c) %>% 
  select(-name) %>% 
  mutate(name = str_remove_all(value, "[0-9]|\\."),
         value = parse_number(value)) %>% 
  na.omit() %>% 
  pivot_wider(names_prefix = "proportion_", values_fill = 0) %>% 
  clean_names()

# OUTPUT ----
#># A tibble: 3 × 4
#>  ingredients   proportion_bz proportion_fz proportion_ht
#>  <chr>                 <dbl>         <dbl>         <dbl>
#>1 1.5BZ 1FZ 2HT           1.5             1           2  
#>2 2FZ                     0               2           0  
#>3 0.5HT 2BZ               2               0           0.5


  • Related