Somehow I am not able to find anything online about how to set a pattern ending to a double \n. My particular case is the following. I have this string:
"1 Matt\n00:00:00,100 --> 00:00:01,500\nThis is said \nby Matt.\n\n2 Lucas\n00:00:01,700 --> 00:00:02,300\nWhile this is said by Lucas"
And I would like to extract only the texts between digit\n
and \n\n
. So, in my case, I'd like to have
This is said \nby Matt.
While this is said by Lucas
Although I am not very skilled with RegEx, I tried many combinations such as ?<=\d\n).*?(?=\n\n)
, ?<=\d\n).\n\n
and ?<=\d\n).*?(?=\r\n\r\n)
but without any luck.
I have tried those as well as others with R's stringr
library, but also with python's re
.
The issue first came up in this answer: https://stackoverflow.com/a/72547966/19284124
CodePudding user response:
You can make the .
match across lines with the (?s)
inline modifier and extend the double newline pattern to alternatively match the end of string:
(?s)(?<=\d\n).*?(?=\n\n|\Z)
See the regex demo.
Details:
(?s)
- a flag allowing.
match line break chars(?<=\d\n)
- a positive lookbehind that matches a location that is immediately preceded with a digit and a newline.*?
- any zero or more chars, as few as possible(?=\n\n|\Z)
- a positive lookahead that matches a location that is immediately followed with two newline chars or end of string.
CodePudding user response:
This regex is more efficient and is a variant that would work on many regex flavors such as Javascript, PHP, Python, java, .NET etc because we avoid using (?s)
and \Z
or \z
:
(?<=\d\n)(?:.*\n)*?.*(?=\n\n|$)
Make sure to use it without MULTILINE
mode.