Home > Software engineering >  Transforming a `data.frame` using a character variable containing a string of data
Transforming a `data.frame` using a character variable containing a string of data

Time:10-08

I have a data.frame that contains a character variable, which has a string of additional metadata (sort of a key-value format) that I'd like to have as variables in a data.frame; some of these metadata have multiple values (an array) of different length, and not all observations have all of the additional data, and would therefore need to be empty or NA

A representative sample:

dat <- data.frame(title = c("How To", "Why To", "When To"),
                  id = c("001", "005", "102"),
                  tags = c("Type: Article, Length: 3.5, Topics: self help, DIY",
                           "Type: Paper, Topics: philosophy",
                           "Language: EN, Type: Checklist, Topics: scheduling, time-management"))  

The desired output would be a data.frame (or equivalent, like tibble) such as:

#>           title          id   tags  Language        Type Length                       Topics
#>           <chr>       <chr>  <chr>     <chr>       <chr>  <int>                        <chr>
#> 1        How To         001    ...        NA     Article    3.5               self help, DIY
#> 2        Why To         005    ...        NA       Paper     NA                   philosophy
#> 3       When To         102    ...        EN   Checklist     NA  scheduling, time-management

NB: I've used ... for shorthand representing the original string in dat

A tidyr approach would be preferable but given my stitching together of various solutions from similar problems has only advanced me a little, any solution would be helpful

CodePudding user response:

This seems to work on the sample data, but there's probably a much shorter version with some regex that distinguishes between the two uses of comma.

library(tidyverse)
dat %>%
  separate_rows(tags, sep = ", ") %>%
  separate(tags, into = c("header", "values"), fill = "left", sep = ": ") %>%
  fill(header, .direction = "down") %>%
  group_by(title, id, header) %>%
  summarize(values = paste(values, collapse = ", "), .groups = "drop") %>%
  pivot_wider(names_from = header, values_from = values)

Result

# A tibble: 3 × 6
  title   id    Length Topics                      Type      Language
  <chr>   <chr> <chr>  <chr>                       <chr>     <chr>   
1 How To  001   3.5    self help, DIY              Article   NA      
2 When To 102   NA     scheduling, time-management Checklist EN      
3 Why To  005   NA     philosophy                  Paper     NA 

CodePudding user response:

Expanding on the answer from Jon Spring but with a rexeg that distinguishes between the two uses of the comma:

library(dplyr)
library(tidyr)
dat %>% 
  separate_rows(tags, sep = "(, )(?=[[:alpha:]] :)") %>% 
  separate(tags, into = c("header", "value"), fill = "left", sep = ": ") %>% 
  pivot_wider(names_from = header, values_from = value)
#> # A tibble: 3 × 6
#>   title   id    Type      Length Topics                      Language
#>   <chr>   <chr> <chr>     <chr>  <chr>                       <chr>   
#> 1 How To  001   Article   3.5    self help, DIY              <NA>    
#> 2 Why To  005   Paper     <NA>   philosophy                  <NA>    
#> 3 When To 102   Checklist <NA>   scheduling, time-management EN

The regex identifies all , (comma with space) that are followed by one or more letters ([[:alpha:]] , means one or more), followed by a :.

If you want to keep the old tag, just add a line mutate(old_tag = tags) %>% before the line with separate_rows

  • Related