Trying to parse a fairly complicated string in R which requires splitting the string by a multi-character vector, and keeping various parts of the delimiter before and after the split.
To describe in words:
- I have a long string made of multiple entries. Each entry begins with a number of varying length followed by "\t".
- Each entry contains multiple paragraphs I would also like to split. The ends of paragraphs follow the pattern: character, period, character (without a space)
- I would like to split every entry, keeping the entry number at the beginning of the entry
- I would like to split every paragraph, keeping the period at the end of the first paragraph
input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."
# desired output
[[1]] "1\tThis is a sentence. This is still part of the first paragraph."
[[2]] "This is now the second paragraph."
[[3]] "10t\This is sentence number 1 of the tenth entry."
[[4]] "This is the second paragraph now. Still the second paragraph."
I have found some answers here, but I have not been able to extend this to multi-character delimiters.
CodePudding user response:
Heres one possible approach.
\w
in regex is a word character, it will match letter, digit or underscore, (\\w\\.)(\\w)
will search for a pattern where there's a "." between 2 word-characters, parentheses divide this match into 2 groups that can be referenced. "\\1###\\2"
is replacement pattern where \1
& \2
refer to regex groups in previous match.
So it adds a dummy delimiter where the split should take place. And then we can split by ###
without removing any of the original content.
input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."
input |> gsub("(\\w\\.)(\\w)", "\\1###\\2", x = _) |>
strsplit("###", fixed = T) |> unlist()
#> [1] "1\tThis is a sentence. This is still part of the first paragraph."
#> [2] "This is now the second paragraph."
#> [3] "10\tThis is sentence number 1 of the tenth entry."
#> [4] "This is the second sentence now. Still the second paragraph."
Created on 2023-01-21 with reprex v2.0.2
CodePudding user response:
Using strsplit
, but with a lookbehind on a capture group.
strsplit(input, '(?<=(\\.(?=\\w)))', perl=TRUE) |> unlist()
# [1] "1\tThis is a sentence. This is still part of the first paragraph."
# [2] "This is now the second paragraph."
# [3] "10\tThis is sentence number 1 of the tenth entry."
# [4] "This is the second sentence now. Still the second paragraph."
Data:
input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."