Home > OS >  Messy string column to wide format
Messy string column to wide format

Time:04-09

I have the following data set:

Input

date        string                                           value
2021-01-01  a=uk_b=goo1_c=brandA_d=phrase_d1 for pedro 2020    20
2021-02-01  a=us_b=goo2_c=brandB_d=phrase_d2 for peter 2020    30
2021-01-15  a=ca_b=goo2_c=brandC_e=102331                      40
2022-01-15  2                                                   0

I need to create a wide data frame based on values in string (see output below).I have hundreds of names, this is just a reproducible example.

Desired output

date        a   b     c      d                          e     value 2
2021-01-01  uk  goo1  brandA phrase_d1 for pedro 2020   NA     20  NA
2021-02-01  us  goo2  brandB phrase_d2 for peter 2020   NA     30  NA
2021-01-15  ca  goo2  brandC NA                        102331  40  NA
2022-01-15  NA  NA    NA     NA                         NA      0  NA

What would be a neat solution? I'm thinking of a combination of reshape and sub probably will take care of it.

Data

data = data.frame(date =c("2021-01-01","2021-02-01","2021-01-15","2022-01-15"),
                  string = c("a=uk_b=goo1_c=brandA_d=phrase_d1 for pedro 2020",
                             "a=us_b=goo2_c=brandB_d=phrase_d2 for peter 2020",
                             "a=ca_b=goo2_c=brandC_e=102331",2),
                  value = c(20,30,40,0))

CodePudding user response:

Another possible solution:

library(tidyverse)

data %>% 
  separate_rows(string, sep="_(?!d\\d)") %>% 
  separate(string, into=c("n1", "n2"), sep = "=", fill = "right") %>% 
  pivot_wider(id_cols = c(date, value), names_from = n1, values_from = n2)

#> # A tibble: 3 × 7
#>   date       value a     b     c      d                        e     
#>   <chr>      <dbl> <chr> <chr> <chr>  <chr>                    <chr> 
#> 1 2021-01-01    20 uk    goo1  brandA phrase_d1 for pedro 2020 <NA>  
#> 2 2021-02-01    30 us    goo2  brandB phrase_d2 for peter 2020 <NA>  
#> 3 2021-01-15    40 ca    goo2  brandC <NA>                     102331

CodePudding user response:

@PaulS's solution is more succinct than mine, but requires that the only underscores in the strings that are to be printed in the variables have a d and then a number behind them. If there are underscores that have other unknown patterns following them, the solution will break. Here's a quick example:

dat <- tibble::tribble(
  ~date,        ~string,                                           ~value,
"2021-01-01",  "abc=uk_def=goo1_ghi=brandA_jkl=phrase_dx for pedro 2020", 20,
"2021-02-01",  "abc=us_def=goo2_ghi=brandB_jkl=phrase_d2 for peter 2020", 30,
"2021-01-015", "abc=ca_def=goo2_ghi=brandC_mno=102331", 40)

library(stringr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)


dat %>% 
  separate_rows(string, sep="_(?!d\\d)") %>% 
  separate(string, into=c("n1", "n2"), sep = "=", fill = "right") %>% 
  pivot_wider(id_cols = c(date, value), names_from = n1, values_from = n2)
#> # A tibble: 3 × 8
#>   date        value abc   def   ghi    jkl                `dx for pedro …` mno  
#>   <chr>       <dbl> <chr> <chr> <chr>  <chr>              <chr>            <chr>
#> 1 2021-01-01     20 uk    goo1  brandA phrase             <NA>             <NA> 
#> 2 2021-02-01     30 us    goo2  brandB phrase_d2 for pet… <NA>             <NA> 
#> 3 2021-01-015    40 ca    goo2  brandC <NA>               <NA>             1023…

My solution is a bit more complicated, but I think it works in a wider variety of cases:

make_df <- function(string, labs=letters[1:5]){
  str <- str_split(string, "=", simplify=TRUE)
  if(length(str) == 1){
    nm <- str[1]
    str <- list(NA)
    names(str) <- nm 
  }
  if(length(str) > 1){
    nm <- c(str[1], gsub(".*_(.*?)$", "\\1", str[2:(length(str)-1)]))
    str <- str[-1]
    str <- gsub(paste0("_", nm, collapse="|"), "", str)
    str <- as.list(str)
    names(str) <- nm
  }
  do.call(data.frame, str)
}

dat %>% 
  rowwise() %>% 
  mutate(out = make_df(string)) %>% 
  unnest(out) %>% 
  select(-string)
#> # A tibble: 4 × 8
#>   date        value abc   def   ghi    jkl                      mno    X2   
#>   <chr>       <dbl> <chr> <chr> <chr>  <chr>                    <chr>  <lgl>
#> 1 2021-01-01     20 uk    goo1  brandA phrase_dx for pedro 2020 <NA>   NA   
#> 2 2021-02-01     30 us    goo2  brandB phrase_d2 for peter 2020 <NA>   NA   
#> 3 2021-01-015    40 ca    goo2  brandC <NA>                     102331 NA   
#> 4 2021-91-15      0 <NA>  <NA>  <NA>   <NA>                     <NA>   NA

Created on 2022-04-08 by the reprex package (v2.0.1)

If the strings with underscores in them are as regular as in the example, the @PaulS's solution is better. Otherwise, this one might work.

  • Related