I have the following data set:
Input
date string value
2021-01-01 a=uk_b=goo1_c=brandA_d=phrase_d1 for pedro 2020 20
2021-02-01 a=us_b=goo2_c=brandB_d=phrase_d2 for peter 2020 30
2021-01-15 a=ca_b=goo2_c=brandC_e=102331 40
2022-01-15 2 0
I need to create a wide data frame based on values in string
(see output below).I have hundreds of names, this is just a reproducible example.
Desired output
date a b c d e value 2
2021-01-01 uk goo1 brandA phrase_d1 for pedro 2020 NA 20 NA
2021-02-01 us goo2 brandB phrase_d2 for peter 2020 NA 30 NA
2021-01-15 ca goo2 brandC NA 102331 40 NA
2022-01-15 NA NA NA NA NA 0 NA
What would be a neat solution? I'm thinking of a combination of reshape and sub probably will take care of it.
Data
data = data.frame(date =c("2021-01-01","2021-02-01","2021-01-15","2022-01-15"),
string = c("a=uk_b=goo1_c=brandA_d=phrase_d1 for pedro 2020",
"a=us_b=goo2_c=brandB_d=phrase_d2 for peter 2020",
"a=ca_b=goo2_c=brandC_e=102331",2),
value = c(20,30,40,0))
CodePudding user response:
Another possible solution:
library(tidyverse)
data %>%
separate_rows(string, sep="_(?!d\\d)") %>%
separate(string, into=c("n1", "n2"), sep = "=", fill = "right") %>%
pivot_wider(id_cols = c(date, value), names_from = n1, values_from = n2)
#> # A tibble: 3 × 7
#> date value a b c d e
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 2021-01-01 20 uk goo1 brandA phrase_d1 for pedro 2020 <NA>
#> 2 2021-02-01 30 us goo2 brandB phrase_d2 for peter 2020 <NA>
#> 3 2021-01-15 40 ca goo2 brandC <NA> 102331
CodePudding user response:
@PaulS's solution is more succinct than mine, but requires that the only underscores in the strings that are to be printed in the variables have a d
and then a number behind them. If there are underscores that have other unknown patterns following them, the solution will break. Here's a quick example:
dat <- tibble::tribble(
~date, ~string, ~value,
"2021-01-01", "abc=uk_def=goo1_ghi=brandA_jkl=phrase_dx for pedro 2020", 20,
"2021-02-01", "abc=us_def=goo2_ghi=brandB_jkl=phrase_d2 for peter 2020", 30,
"2021-01-015", "abc=ca_def=goo2_ghi=brandC_mno=102331", 40)
library(stringr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
dat %>%
separate_rows(string, sep="_(?!d\\d)") %>%
separate(string, into=c("n1", "n2"), sep = "=", fill = "right") %>%
pivot_wider(id_cols = c(date, value), names_from = n1, values_from = n2)
#> # A tibble: 3 × 8
#> date value abc def ghi jkl `dx for pedro …` mno
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2021-01-01 20 uk goo1 brandA phrase <NA> <NA>
#> 2 2021-02-01 30 us goo2 brandB phrase_d2 for pet… <NA> <NA>
#> 3 2021-01-015 40 ca goo2 brandC <NA> <NA> 1023…
My solution is a bit more complicated, but I think it works in a wider variety of cases:
make_df <- function(string, labs=letters[1:5]){
str <- str_split(string, "=", simplify=TRUE)
if(length(str) == 1){
nm <- str[1]
str <- list(NA)
names(str) <- nm
}
if(length(str) > 1){
nm <- c(str[1], gsub(".*_(.*?)$", "\\1", str[2:(length(str)-1)]))
str <- str[-1]
str <- gsub(paste0("_", nm, collapse="|"), "", str)
str <- as.list(str)
names(str) <- nm
}
do.call(data.frame, str)
}
dat %>%
rowwise() %>%
mutate(out = make_df(string)) %>%
unnest(out) %>%
select(-string)
#> # A tibble: 4 × 8
#> date value abc def ghi jkl mno X2
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <lgl>
#> 1 2021-01-01 20 uk goo1 brandA phrase_dx for pedro 2020 <NA> NA
#> 2 2021-02-01 30 us goo2 brandB phrase_d2 for peter 2020 <NA> NA
#> 3 2021-01-015 40 ca goo2 brandC <NA> 102331 NA
#> 4 2021-91-15 0 <NA> <NA> <NA> <NA> <NA> NA
Created on 2022-04-08 by the reprex package (v2.0.1)
If the strings with underscores in them are as regular as in the example, the @PaulS's solution is better. Otherwise, this one might work.