I have a data.frame
that contains a character variable, which has a string of additional metadata (sort of a key-value format) that I'd like to have as variables in a data.frame
; some of these metadata have multiple values (an array) of different length, and not all observations have all of the additional data, and would therefore need to be empty or NA
A representative sample:
dat <- data.frame(title = c("How To", "Why To", "When To"),
id = c("001", "005", "102"),
tags = c("Type: Article, Length: 3.5, Topics: self help, DIY",
"Type: Paper, Topics: philosophy",
"Language: EN, Type: Checklist, Topics: scheduling, time-management"))
The desired output would be a data.frame
(or equivalent, like tibble
) such as:
#> title id tags Language Type Length Topics
#> <chr> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 How To 001 ... NA Article 3.5 self help, DIY
#> 2 Why To 005 ... NA Paper NA philosophy
#> 3 When To 102 ... EN Checklist NA scheduling, time-management
NB: I've used ...
for shorthand representing the original string in dat
A tidyr
approach would be preferable but given my stitching together of various solutions from similar problems has only advanced me a little, any solution would be helpful
CodePudding user response:
This seems to work on the sample data, but there's probably a much shorter version with some regex that distinguishes between the two uses of comma.
library(tidyverse)
dat %>%
separate_rows(tags, sep = ", ") %>%
separate(tags, into = c("header", "values"), fill = "left", sep = ": ") %>%
fill(header, .direction = "down") %>%
group_by(title, id, header) %>%
summarize(values = paste(values, collapse = ", "), .groups = "drop") %>%
pivot_wider(names_from = header, values_from = values)
Result
# A tibble: 3 × 6
title id Length Topics Type Language
<chr> <chr> <chr> <chr> <chr> <chr>
1 How To 001 3.5 self help, DIY Article NA
2 When To 102 NA scheduling, time-management Checklist EN
3 Why To 005 NA philosophy Paper NA
CodePudding user response:
Expanding on the answer from Jon Spring but with a rexeg that distinguishes between the two uses of the comma:
library(dplyr)
library(tidyr)
dat %>%
separate_rows(tags, sep = "(, )(?=[[:alpha:]] :)") %>%
separate(tags, into = c("header", "value"), fill = "left", sep = ": ") %>%
pivot_wider(names_from = header, values_from = value)
#> # A tibble: 3 × 6
#> title id Type Length Topics Language
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 How To 001 Article 3.5 self help, DIY <NA>
#> 2 Why To 005 Paper <NA> philosophy <NA>
#> 3 When To 102 Checklist <NA> scheduling, time-management EN
The regex identifies all ,
(comma with space) that are followed by one or more letters ([[:alpha:]]
,
means one or more), followed by a :
.
If you want to keep the old tag, just add a line mutate(old_tag = tags) %>%
before the line with separate_rows