Home > Software design >  Separate strings by a character in columns in r
Separate strings by a character in columns in r

Time:06-17

I have a column in a dataframe with scraped prices like this:

prices
$1,50 $1,20
$1,50
$1,75 $1,25 $1,35

In summary in each column I can have many prices. What I would like is to obtain different columns that are separated from $, this is what I need based on the example that I put.

prices               price1 price2 price3
$1,50 $1,20          1,50   1,20   NA
$1,50                1,50   NA     NA
$1,75 $1,25 $1,35    1,75   1,25   1,35

I have tried the following but neither option does what I need. Help

str_split(prices, pattern = '[$]') # I get a column with values like this c("", "1,50")
separate(prices, sep = '[$]', into = c("price1", "price2"), remove = FALSE) 
#Price1 is created empty and I am trying to use it in a function, 
#so in some dataframes the number of prices can vary.

CodePudding user response:

One approach using dplyr:

df %>% 
  rowwise() %>% 
  mutate(price = list(gsub("$", "",strsplit(prices, " ")[[1]],fixed = T))) %>% 
  unnest_wider(price,names_sep = "")

Output:

  prices            price1 price2 price3
  <chr>             <chr>  <chr>  <chr> 
1 $1,50 $1,20       1,50   1,20   NA    
2 $1,50             1,50   NA     NA    
3 $1,75 $1,25 $1,35 1,75   1,25   1,35 

Input:

df = structure(list(prices = c("$1,50 $1,20", "$1,50", "$1,75 $1,25 $1,35"
)), class = "data.frame", row.names = c(NA, -3L))

CodePudding user response:

If your default locale has comma as the decimal separator, then:

library(tidyverse)
options("readr.default_locale" = readr::locale(decimal_mark = ","))

df <- tibble(prices =
               c("$1,50 $1,20",
                 "$1,50",
                 "$1,75 $1,25 $1,35"))
df |>
  mutate(prices = prices |>
           str_split(" ") |>
           map( ~ str_remove(., "\\$"))) |>
  unnest_wider(prices) |>
  mutate(across(.fns = readr::parse_number))
#> New names:
#> New names:
#> New names:
#> • `` -> `...1`
#> • `` -> `...2`
#> # A tibble: 3 × 3
#>    ...1  ...2  ...3
#>   <dbl> <dbl> <dbl>
#> 1  1.5   1.2  NA   
#> 2  1.5  NA    NA   
#> 3  1.75  1.25  1.35

Otherwise:

df |>
  mutate(prices = prices |>
           str_split(" ") |>
           map( ~ str_remove(., "\\$"))) |>
  unnest_wider(prices) |>
  mutate(across(.fns = ~ readr::parse_number(., locale = readr::locale(decimal_mark = ","))))
#> New names:
#> New names:
#> New names:
#> • `` -> `...1`
#> • `` -> `...2`
#> # A tibble: 3 × 3
#>    ...1  ...2  ...3
#>   <dbl> <dbl> <dbl>
#> 1  1.5   1.2  NA   
#> 2  1.5  NA    NA   
#> 3  1.75  1.25  1.35

CodePudding user response:

With cSplit:

library(splitstackshape)
s <- cSplit(df, "prices", "$", type.convert = T)[, -1]
df[, paste0("price", 1:ncol(s))] <- s

#             prices  price1  price2  price3
#1       $1,50 $1,20    1,50    1,20    <NA>
#2             $1,50    1,50    <NA>    <NA>
#3 $1,75 $1,25 $1,35    1,75    1,25    1,35

CodePudding user response:

Add a prices_text column, separate it into rows, remove rows with "" entries, add a column n which counts the rows within prices_next to use as names and convert to wide form.

library(dplyr)
library(tidyr)

DF %>% 
  mutate(prices_text = prices) %>% 
  separate_rows(prices_text, sep = "[$ ] ") %>%
  filter(prices_text != "") %>%
  group_by(prices) %>%
  mutate(n = paste0("prices", 1:n())) %>%
  ungroup %>%
  pivot_wider(id_cols = prices, values_from = prices_text, names_from = n)

giving:

# A tibble: 3 × 4
  prices            prices1 prices2 prices3
  <chr>             <chr>   <chr>   <chr>  
1 $1,50 $1,20       1,50    1,20    <NA>   
2 $1,50             1,50    <NA>    <NA>   
3 $1,75 $1,25 $1,35 1,75    1,25    1,35   
  • Related