How to split string of characters by commas but keep dates?-CodePudding

I have a string of characters like this in R

ABCDE,"January 10, 2010",F,,,,GH,"March 9, 2009",,,

I would like to do something like str.split() to partition by all combinations of commas and quotation marks into an array of strings, but keep the commas in quotation marks that represent dates so that I get:

ABCDE
January 10, 2010
F
GH
March 9, 2009

Thanks

CodePudding user response：

This is one approach

data.frame(list = na.omit(
  unname(unlist(read.csv(
    text = 'ABCDE,"January 10, 2010",F,,,,GH,"March 9, 2009",,,', 
    check.names = F, header = F)))))
              list
1            ABCDE
2 January 10, 2010
3            FALSE
4               GH
5    March 9, 2009

CodePudding user response：

You should probably be using a CSV parser here, but if you wanted to use a pure regex approach you could try:

library(stringr)
library(dplyr)

x <- "ABCDE,\"January 10, 2010\",F,,,,GH,\"March 9, 2009\",,,"
y <- str_match_all(x, "\"(.*?)\"|[^,] ")[[1]]
output <- coalesce(y[,2], y[,1])
output

[1] "ABCDE"            "January 10, 2010" "F"                "GH"
[5] "March 9, 2009"

The regex pattern uses an alternation trick and says to match:

"(.*?)" match a date in quotes, but don't capture the quotes
| OR
[^,] match single CSV term

CodePudding user response：

If the pattern is as showed, then a regex option would be to create delimiter and make use of read.table

read.table(text = gsub('"', '', gsub('("[^,"] ,)(*SKIP)(*FAIL)|,',
   '\n', trimws(gsub(",{2,}", ",", str1), whitespace = ","), perl = TRUE)), 
    header = FALSE, fill = TRUE, sep = "\n")

-output

                V1
1            ABCDE
2 January 10, 2010
3                F
4               GH
5    March 9, 2009

Or with scan

data.frame(V1 = setdiff(scan(text = str1, sep = ",",
    what = character()), ""))

-output

              V1
1            ABCDE
2 January 10, 2010
3                F
4               GH
5    March 9, 2009

data

str1 <- "ABCDE,\"January 10, 2010\",F,,,,GH,\"March 9, 2009\",,,"

CodePudding user response：

Another option could be:

na.omit(stack(read.csv(text = str1, header = FALSE)))[1]

            values
1            ABCDE
2 January 10, 2010
3            FALSE
4               GH
5    March 9, 2009

txt <- 'ABCDE,"January 10, 2010",F,,,,GH,"March 9, 2009",,,'