I have strings like:
string <- "1, 2, \"something, else\""
I want to use tidyr::separate_rows()
with sep==","
, but the comma inside the quoted portion of the string is tripping me up. I'd like to remove the comma between something and else (but only this comma).
Here's a more complex toy example:
string <- c("1, 2, \"something, else\"", "3, 5, \"more, more, more\"", "6, \"commas, are fun\", \"no, they are not\"")
string
#[1] "1, 2, \"something, else\""
#[2] "3, 5, \"more, more, more\""
#[3] "6, \"commas, are fun\", \"no, they are not\""
I want to get rid of all commas inside the embedded quotations. Desired output:
[1] "1, 2, \"something else\""
[2] "3, 5, \"more more more\""
[3] "6, \"commas are fun\", \"no they are not\""
CodePudding user response:
You can define a small function to do the replacement.
library(stringr)
rmcom <- function(x) gsub(",", "", x)
str_replace(string, "(\".*, .*\")", rmcom)
[1] "1, 2, \"something else\""
[2] "3, 5, \"more more more\""
[3] "6, \"commas are fun\" \"no they are not\""
CodePudding user response:
Best I can do:
stringr::str_replace_all(string,"(?<=\\\".{1,15})(,)(?=. ?\\\")","")
it's:
(?<= )
= look behind
\\\"
= a \
and a "
.{1,15}
= between 1 and 15 characters (see note)
(,)
= the comma is what we want to target
(?= )
look ahead
. ?
= one or more characters but as few as possible
\\\"
= a \
and a "
note: look behind cannot be unbounded, so we can't use . ?
here. Adjust the max of 15 for your dataset.
edit: Andre Wildberg's solution is better - I stupidly forgot that the "" defining the string are not part of the string, so made it much more complex than it needed to be.
CodePudding user response:
Altenatively, we could invert the problem (and keep the comma, which might be useful) and use a regex directly with separate_rows
to split only at the comma NOT inside quotes:
library(tidyr)
df |>
separate_rows(stringcol, sep = '(?!\\B"[^\"]*), (?![^"]*\"\\B)')
Regex expression from: Regex find comma not inside quotes
Alternatively: Regex to pick characters outside of pair of quotes
Output:
# A tibble: 9 × 1
stringcol
<chr>
1 "1"
2 "2"
3 "\"something, else\""
4 "3"
5 "5"
6 "\"more, more, more\""
7 "6"
8 "\"commas, are fun\""
9 "\"no, they are not\""
Data:
library(tibble)
df <- tibble(stringcol = string)