string_dat <- structure(list(ID = c(2455, 2455), Location = c("c(\"Southside of Dune\", \"The Hogwarts Express\")",
"Vertex, Inc.")), class = "data.frame", row.names = c(NA, -2L
))
> string_dat
ID Location
1 2455 c("Southside of Dune", "The Hogwarts Express")
2 2455 Vertex, Inc.
I would like to expand the data.frame above based on Location
.
library(tidyr)
> string_dat %>% tidyr::separate_rows(Location, sep = ",")
# A tibble: 4 × 2
ID Location
<dbl> <chr>
1 2455 "c(\"Southside of Dune\""
2 2455 " \"The Hogwarts Express\")"
3 2455 "Vertex"
4 2455 " Inc."
Splitting just on ,
wrongly split Vertex, Inc.
into two entries. Also it did not take care of c(\"
and \""
for the first two strings.
I also tried to remove the c(\"
at the beginning by using gsub
, but it gave me the following error.
> gsub('c(\"', "", x = string_dat$Location)
Error in gsub("c(\"", "", x = string_dat$Location) :
invalid regular expression 'c("', reason 'Missing ')''
My desired output is
# A tibble: 3 × 2
ID Location
<dbl> <chr>
1 2455 "Southside of Dune"
2 2455 "The Hogwarts Express"
3 2455 "Vertex, Inc."
********** Edit **********
library(tidyverse)
string_dat %>%
mutate(
# mark twin elements with `;`:
Location = str_replace(Location, '",', '";'),
# remove string-first `c` and all non-alphanumeric characters
# except `,`, `.`, and `;`:
Location = str_replace_all(Location, '^c|(?![.,; ])\\W', '')) %>%
separate_rows(Location, sep = '; ')
# A tibble: 3 × 2
ID Location
<dbl> <chr>
1 2455 "c(\"Southside of Dune\""
2 2455 "\"The Hogwarts Express\")"
3 2455 "Vertex, Inc."
CodePudding user response:
Here's an approach that combines data cleaning with separate_rows
:
library(tidyverse)
string_dat %>%
mutate(
# mark twin elements with `;`:
Location = str_replace(Location, '",', '";'),
# remove string-first `c` and all non-alphanumeric characters
# except `,`, `.`, and `;`:
Location = str_replace_all(Location, '^c|(?![.,; ])\\W', '')) %>%
separate_rows(Location, sep = '; ')
# A tibble: 3 × 2
ID Location
<dbl> <chr>
1 2455 Southside of Dune
2 2455 The Hogwarts Express
3 2455 Vertex, Inc.
How the regex ^c|(?![.,; ])\\W
works:
^c
: matches literal c at the beginning of the string|
: initiates alternation (i.e., "OR")(?![.,; ])\\W
: negative lookahead to assert that any non-alphanumeric characters (\\W
with upper-case "W") are matched except any of period, comma, and semi-colon (this exception from the\\W
character class is implemented by the lookahead)
CodePudding user response:
The Location
column has a strange data format. In the 1st element, it stores R code, because it's using the c("s1", "s2")
syntax for a two-element character vector. For the 2nd element, you're missing escaped quotes for this to be valid R code for a one-element character vector.
If I manually edit the 2nd element to add these quotation marks, then we can easily evaluate the R code contained in the Location
column, and then unnest the resulting list column. This might be easier than attempting to edit the strings programmatically?
library(tidyverse)
string_dat <- data.frame(
ID = c(2455, 2455),
Location = c("c(\"Southside of Dune\", \"The Hogwarts Express\")", "\"Vertex, Inc.\"")
)
string_dat %>%
rowwise() %>%
mutate(Location = list(eval(parse(text=Location)))) %>%
unnest(cols=Location)
#> # A tibble: 3 × 2
#> ID Location
#> <dbl> <chr>
#> 1 2455 Southside of Dune
#> 2 2455 The Hogwarts Express
#> 3 2455 Vertex, Inc.
Created on 2022-09-23 by the reprex package (v2.0.1)