Remove comma inside quotes-CodePudding

I have strings like:

string <- "1, 2, \"something, else\""

I want to use tidyr::separate_rows() with sep==",", but the comma inside the quoted portion of the string is tripping me up. I'd like to remove the comma between something and else (but only this comma).

Here's a more complex toy example:

string <- c("1, 2, \"something, else\"", "3, 5, \"more, more, more\"", "6, \"commas, are fun\", \"no, they are not\"")

string
#[1] "1, 2, \"something, else\""                   
#[2] "3, 5, \"more, more, more\""                  
#[3] "6, \"commas, are fun\", \"no, they are not\""

I want to get rid of all commas inside the embedded quotations. Desired output:

[1] "1, 2, \"something else\""                  
[2] "3, 5, \"more more more\""                  
[3] "6, \"commas are fun\", \"no they are not\""

CodePudding user response：

You can define a small function to do the replacement.

library(stringr)

rmcom <- function(x) gsub(",", "", x)

str_replace(string, "(\".*, .*\")", rmcom)
[1] "1, 2, \"something else\""
[2] "3, 5, \"more more more\""
[3] "6, \"commas are fun\" \"no they are not\""

CodePudding user response：

Best I can do:

stringr::str_replace_all(string,"(?<=\\\".{1,15})(,)(?=. ?\\\")","")

it's: (?<= ) = look behind

\\\" = a \ and a "

.{1,15} = between 1 and 15 characters (see note)

(,) = the comma is what we want to target

(?= ) look ahead

. ? = one or more characters but as few as possible

\\\" = a \ and a "

note: look behind cannot be unbounded, so we can't use . ? here. Adjust the max of 15 for your dataset.

edit: Andre Wildberg's solution is better - I stupidly forgot that the "" defining the string are not part of the string, so made it much more complex than it needed to be.

CodePudding user response：

Altenatively, we could invert the problem (and keep the comma, which might be useful) and use a regex directly with separate_rows to split only at the comma NOT inside quotes:

library(tidyr)

df |>
  separate_rows(stringcol, sep = '(?!\\B"[^\"]*), (?![^"]*\"\\B)')

Regex expression from: Regex find comma not inside quotes

Alternatively: Regex to pick characters outside of pair of quotes

Output:

# A tibble: 9 × 1
  stringcol             
  <chr>                 
1 "1"                   
2 "2"                   
3 "\"something, else\"" 
4 "3"                   
5 "5"                   
6 "\"more, more, more\""
7 "6"                   
8 "\"commas, are fun\"" 
9 "\"no, they are not\""

Data:

library(tibble)

df <- tibble(stringcol = string)