I have a text from which I want to extract the first two paragraphs. The text consists of several paragraphs seperated by empty lines. The paragraphs themselves can contain line breaks. What I want to extract is everything from the beginning of the text until the second empty line. This is the original text:
Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.
Buy me a Beer: https://www.buymeacoffee.com/johnnyfd
Support the GoFundMe: http://gofundme.com/f/send-money-dire...
Follow Me:
The text I want to have is:
Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.
Buy me a Beer: https://www.buymeacoffee.com/johnnyfd
I tried to create a regular expression doing the job and I though the following seemed to be a possible solution:
(.*|\n)*(?:[[:blank:]]*\n){2,}(.*|\n)*(?:[[:blank:]]*\n){2,}
When I use it in R in stri_extract_all_regex, I receive the following error:
Error in stri_extract_all_regex(video_desc_orig, "(.*|\n)*?(?:[[:blank:]]*\n){2,}(.*?|\n)*(?:[[:blank:]]*\n){2,}") :
Regular expression backtrack stack overflow. (U_REGEX_STACK_OVERFLOW)
It's the first time for me using Regex and I really don't know how to interpret this error. Any help appreciated ;)
CodePudding user response:
In R you need to do double slashes \\
.
string <- 'Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.
Buy me a Beer: https://www.buymeacoffee.com/johnnyfd
Support the GoFundMe: http://gofundme.com/f/send-money-dire...
Follow Me: '
library(stringr)
string |>
str_extract('(.*|\\n)*(?:[[:blank:]]*\\n){2,}(.*|\\n)*(?:[[:blank:]]*\\n){2,}') |>
cat()
# Output
Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.
Buy me a Beer: https://www.buymeacoffee.com/johnnyfd
CodePudding user response:
You have nested quantifiers like (.*|\n)*
which creates a lot of paths to explore. This pattern for example first matches all text, and then starts to backtrack to fit in the next parts of the pattern.
Including the last 2 newlines, making sure that the lines contain at least a single non whitespace character:
library(stringi)
string <- 'Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.
Buy me a Beer: https://www.buymeacoffee.com/johnnyfd
Support the GoFundMe: http://gofundme.com/f/send-money-dire...
Follow Me: '
stri_extract_all_regex(
string,
'^(?:[^\\S\\n]*\\S.*(?:\\n[^\\S\\n]*\\S.*)*\\n\\n){2}'
)
Output
[[1]]
[1] "Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.\nThen I went to a nice restaurant with them.\n\nBuy me a Beer: https://www.buymeacoffee.com/johnnyfd\n\n"
See a regex demo and a R demo
If you don't want to match the last 2 newlines, you can assert them:
^[^\\S\\n]*\\S.*(?:\\n[^\\S\\n]*\\S.*)*\\n\\n[^\\S\\n]*\\S.*(?:\\n[^\\S\\n]*\\S.*)*(?=\\n\\n)