I have a string of characters like this in R
ABCDE,"January 10, 2010",F,,,,GH,"March 9, 2009",,,
I would like to do something like str.split()
to partition by all combinations of commas and quotation marks into an array of strings, but keep the commas in quotation marks that represent dates so that I get:
ABCDE
January 10, 2010
F
GH
March 9, 2009
Thanks
CodePudding user response:
This is one approach
data.frame(list = na.omit(
unname(unlist(read.csv(
text = 'ABCDE,"January 10, 2010",F,,,,GH,"March 9, 2009",,,',
check.names = F, header = F)))))
list
1 ABCDE
2 January 10, 2010
3 FALSE
4 GH
5 March 9, 2009
CodePudding user response:
You should probably be using a CSV parser here, but if you wanted to use a pure regex approach you could try:
library(stringr)
library(dplyr)
x <- "ABCDE,\"January 10, 2010\",F,,,,GH,\"March 9, 2009\",,,"
y <- str_match_all(x, "\"(.*?)\"|[^,] ")[[1]]
output <- coalesce(y[,2], y[,1])
output
[1] "ABCDE" "January 10, 2010" "F" "GH"
[5] "March 9, 2009"
The regex pattern uses an alternation trick and says to match:
"(.*?)"
match a date in quotes, but don't capture the quotes|
OR[^,]
match single CSV term
CodePudding user response:
If the pattern is as showed, then a regex option would be to create delimiter and make use of read.table
read.table(text = gsub('"', '', gsub('("[^,"] ,)(*SKIP)(*FAIL)|,',
'\n', trimws(gsub(",{2,}", ",", str1), whitespace = ","), perl = TRUE)),
header = FALSE, fill = TRUE, sep = "\n")
-output
V1
1 ABCDE
2 January 10, 2010
3 F
4 GH
5 March 9, 2009
Or with scan
data.frame(V1 = setdiff(scan(text = str1, sep = ",",
what = character()), ""))
-output
V1
1 ABCDE
2 January 10, 2010
3 F
4 GH
5 March 9, 2009
data
str1 <- "ABCDE,\"January 10, 2010\",F,,,,GH,\"March 9, 2009\",,,"
CodePudding user response:
Another option could be:
na.omit(stack(read.csv(text = str1, header = FALSE)))[1]
values
1 ABCDE
2 January 10, 2010
3 FALSE
4 GH
5 March 9, 2009
txt <- 'ABCDE,"January 10, 2010",F,,,,GH,"March 9, 2009",,,'