I'm trying to split the following string
str = "A (B) C, D (E) F, G, H, a (b) c"
into 9 separate strings like: A, B, C, D, E, {F, G, H}, a, b, c
I've tried
str = "A (B) C, D (E) F, G, H, a (b) c"
strr = stri_split_regex(str, "\\(.*?\\)")
strr
and it returns me strr as A, {C, D}, {F, G, H, a}, c
The actual string I'm working with looks like
str2 = "Independent Spirit Award (Co-Nominee) for Anomalisa, Academy Award (Co-Nominee) for Anomalisa, Independent Spirit Award (Co-Winner) for Synecdoche, New York, Independent Spirit Award (Nominee) for Synecdoche, New York"
and I want that to be separated into
Independent Spirit Award; Co-Nominee; for Anomalisa; Academy Award; Co-Nominee; for Anomalisa; Independent Spirit Award; Co-Winner; for Synecdoche, New York; Independent Spirit Award; Nominee; for Synecdoche, New York;
So I think what I need is to split the string so that each separation is done at the brackets, and the letters both inside and outside of the brackets are kept. There's also a tricky part that the commas are placed irregularly, but that I only want the letter right after the closest comma of the next '(' is kept in a separate column.
CodePudding user response:
This pattern splits by open or close paren, or the last comma before an open paren, as well as any adjacent whitespace.
For str
:
library(stringi)
stri_split_regex(str, "\\s*(\\(|\\)|,(?=[^,] \\)))\\s*")
[[1]]
[1] "A" "B" "C" "D" "E" "F, G, H" "a"
[8] "b" "c"
For str2
:
stri_split_regex(str2, "\\s*(\\(|\\)|,(?=[^,] \\)))\\s*")
[[1]]
[1] "Independent Spirit Award" "Co-Nominee"
[3] "for Anomalisa" "Academy Award"
[5] "Co-Nominee" "for Anomalisa"
[7] "Independent Spirit Award" "Co-Winner"
[9] "for Synecdoche, New York" "Independent Spirit Award"
[11] "Nominee" "for Synecdoche, New York"