How to split strings based on backslash with regex in R-CodePudding

string_dat <- structure(list(ID = c(2455, 2455), Location = c("c(\"Southside of Dune\", \"The Hogwarts Express\")", 
"Vertex, Inc.")), class = "data.frame", row.names = c(NA, -2L
))

> string_dat
    ID                                       Location
1 2455 c("Southside of Dune", "The Hogwarts Express")
2 2455                                   Vertex, Inc.

I would like to expand the data.frame above based on Location.

library(tidyr)
> string_dat %>% tidyr::separate_rows(Location, sep = ",")
# A tibble: 4 × 2
     ID Location                    
  <dbl> <chr>                       
1  2455 "c(\"Southside of Dune\""   
2  2455 " \"The Hogwarts Express\")"
3  2455 "Vertex"                    
4  2455 " Inc."

Splitting just on , wrongly split Vertex, Inc. into two entries. Also it did not take care of c(\" and \"" for the first two strings.

I also tried to remove the c(\" at the beginning by using gsub, but it gave me the following error.

> gsub('c(\"', "", x = string_dat$Location)
Error in gsub("c(\"", "", x = string_dat$Location) : 
  invalid regular expression 'c("', reason 'Missing ')''

My desired output is

# A tibble: 3 × 2
     ID Location                    
  <dbl> <chr>                       
1  2455 "Southside of Dune"   
2  2455 "The Hogwarts Express"
3  2455 "Vertex, Inc."

********** Edit **********

library(tidyverse)
string_dat %>% 
  mutate(
    # mark twin elements with `;`:
    Location = str_replace(Location, '",', '";'),
    # remove string-first `c` and all non-alphanumeric characters
    # except `,`, `.`, and `;`:
    Location = str_replace_all(Location, '^c|(?![.,; ])\\W', '')) %>%
  separate_rows(Location, sep = '; ')

# A tibble: 3 × 2
     ID Location                   
  <dbl> <chr>                      
1  2455 "c(\"Southside of Dune\""  
2  2455 "\"The Hogwarts Express\")"
3  2455 "Vertex, Inc."

CodePudding user response：

Here's an approach that combines data cleaning with separate_rows:

library(tidyverse)
string_dat %>% 
  mutate(
    # mark twin elements with `;`:
    Location = str_replace(Location, '",', '";'),
    # remove string-first `c` and all non-alphanumeric characters
    # except `,`, `.`, and `;`:
    Location = str_replace_all(Location, '^c|(?![.,; ])\\W', '')) %>%
  separate_rows(Location, sep = '; ')
# A tibble: 3 × 2
     ID Location            
  <dbl> <chr>               
1  2455 Southside of Dune   
2  2455 The Hogwarts Express
3  2455 Vertex, Inc.

How the regex ^c|(?![.,; ])\\W works:

^c: matches literal c at the beginning of the string
|: initiates alternation (i.e., "OR")
(?![.,; ])\\W: negative lookahead to assert that any non-alphanumeric characters (\\W with upper-case "W") are matched except any of period, comma, and semi-colon (this exception from the \\W character class is implemented by the lookahead)

CodePudding user response：

The Location column has a strange data format. In the 1st element, it stores R code, because it's using the c("s1", "s2") syntax for a two-element character vector. For the 2nd element, you're missing escaped quotes for this to be valid R code for a one-element character vector.

If I manually edit the 2nd element to add these quotation marks, then we can easily evaluate the R code contained in the Location column, and then unnest the resulting list column. This might be easier than attempting to edit the strings programmatically?

library(tidyverse)

string_dat <- data.frame(
  ID = c(2455, 2455), 
  Location = c("c(\"Southside of Dune\", \"The Hogwarts Express\")", "\"Vertex, Inc.\"")
)

string_dat %>%
  rowwise() %>%
  mutate(Location = list(eval(parse(text=Location)))) %>%
  unnest(cols=Location)
#> # A tibble: 3 × 2
#>      ID Location            
#>   <dbl> <chr>               
#> 1  2455 Southside of Dune   
#> 2  2455 The Hogwarts Express
#> 3  2455 Vertex, Inc.

^{Created on 2022-09-23 by the reprex package (v2.0.1)}