R: split string by multi-character delimiter and keep the delimiter-CodePudding

Trying to parse a fairly complicated string in R which requires splitting the string by a multi-character vector, and keeping various parts of the delimiter before and after the split.

To describe in words:

I have a long string made of multiple entries. Each entry begins with a number of varying length followed by "\t".
Each entry contains multiple paragraphs I would also like to split. The ends of paragraphs follow the pattern: character, period, character (without a space)
I would like to split every entry, keeping the entry number at the beginning of the entry
I would like to split every paragraph, keeping the period at the end of the first paragraph

input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."

# desired output
[[1]] "1\tThis is a sentence. This is still part of the first paragraph."
[[2]] "This is now the second paragraph."
[[3]] "10t\This is sentence number 1 of the tenth entry."
[[4]] "This is the second paragraph now. Still the second paragraph."

I have found some answers here, but I have not been able to extend this to multi-character delimiters.

CodePudding user response：

Heres one possible approach.
\w in regex is a word character, it will match letter, digit or underscore, (\\w\\.)(\\w) will search for a pattern where there's a "." between 2 word-characters, parentheses divide this match into 2 groups that can be referenced. "\\1###\\2" is replacement pattern where \1 & \2 refer to regex groups in previous match. So it adds a dummy delimiter where the split should take place. And then we can split by ### without removing any of the original content.

input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."
input |> gsub("(\\w\\.)(\\w)", "\\1###\\2", x = _) |> 
         strsplit("###", fixed = T) |> unlist()
#> [1] "1\tThis is a sentence. This is still part of the first paragraph."
#> [2] "This is now the second paragraph."                                
#> [3] "10\tThis is sentence number 1 of the tenth entry."                
#> [4] "This is the second sentence now. Still the second paragraph."

^{Created on 2023-01-21 with reprex v2.0.2}

CodePudding user response：

Using strsplit, but with a lookbehind on a capture group.

strsplit(input, '(?<=(\\.(?=\\w)))', perl=TRUE) |> unlist()
# [1] "1\tThis is a sentence. This is still part of the first paragraph."
# [2] "This is now the second paragraph."                                
# [3] "10\tThis is sentence number 1 of the tenth entry."                
# [4] "This is the second sentence now. Still the second paragraph."

Data:

input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."