For example, I have something like this:
author - xxx
title - xxxx
abstract - xxxx
date - xxxx
author - xxx
title - xxxx
abstract - xxxx,
xxxxxxxx
xxxxxxxx
date - xxxx
I want to capture everything in between author and date. And every time I meet this pattern, I capture them (no nesting). So, the desired out put looks like:
["\r\nauthor...date xxx", "\r\nauthor...date xxx" ]
The difficulty I met is that there might be arbitrary lines in between "author" and "date", and I also have to take care of the line breaks.
When I use \r\nauthor.*\n.*\n.*\n.*date.*
, it lets me capture all the elements that has "author" four lines apart from "date".
But when I try to handle arbitrary liens by using \r\nauthor((.|\r\n)*?).*date.*
, it just gave me some weird thing.
Can anyone give me some expression I can use for this task? Thanks!
CodePudding user response:
You can use re.findall
with flags re.M
(multi-line) and re.S
(dotall). That way .
won't stop at newlines and ^
/$
will match beginning and ending of lines (regex101):
import re
text = """author - xxx
title - xxxx
abstract - xxxx
date - xxxx
author - xxx
title - xxxx
abstract - xxxx,
xxxxxxxx
xxxxxxxx
date - xxxx"""
for group in re.findall(r"^author.*?^date.*?$", text, flags=re.M | re.S):
print(group)
print("-" * 80)
Prints:
author - xxx
title - xxxx
abstract - xxxx
date - xxxx
--------------------------------------------------------------------------------
author - xxx
title - xxxx
abstract - xxxx,
xxxxxxxx
xxxxxxxx
date - xxxx
--------------------------------------------------------------------------------
CodePudding user response:
I revised my old answer because apparently doesn't was the correct answer for your question. My bad, I didn't readed correctly. The next one should work for you:
author[\w\W] ?date[^\r\n]
CodePudding user response:
I imagine there must be a cleaner/more efficient regex, but this pattern works with global multiline flags
^author(.|\s)*?date.*$