Regex matching everything in between a certain pattern?-CodePudding

For example, I have something like this:

author - xxx
title  - xxxx
abstract - xxxx
date   - xxxx

author - xxx
title  - xxxx
abstract - xxxx,
xxxxxxxx
xxxxxxxx
date   - xxxx

I want to capture everything in between author and date. And every time I meet this pattern, I capture them (no nesting). So, the desired out put looks like:

["\r\nauthor...date xxx", "\r\nauthor...date xxx" ]

The difficulty I met is that there might be arbitrary lines in between "author" and "date", and I also have to take care of the line breaks.

When I use \r\nauthor.*\n.*\n.*\n.*date.*, it lets me capture all the elements that has "author" four lines apart from "date". But when I try to handle arbitrary liens by using \r\nauthor((.|\r\n)*?).*date.*, it just gave me some weird thing. Can anyone give me some expression I can use for this task? Thanks!

CodePudding user response：

You can use re.findall with flags re.M (multi-line) and re.S (dotall). That way . won't stop at newlines and ^/$ will match beginning and ending of lines (regex101):

import re

text = """author - xxx
title  - xxxx
abstract - xxxx
date   - xxxx

author - xxx
title  - xxxx
abstract - xxxx,
xxxxxxxx
xxxxxxxx
date   - xxxx"""

for group in re.findall(r"^author.*?^date.*?$", text, flags=re.M | re.S):
    print(group)
    print("-" * 80)

Prints:

author - xxx
title  - xxxx
abstract - xxxx
date   - xxxx
--------------------------------------------------------------------------------
author - xxx
title  - xxxx
abstract - xxxx,
xxxxxxxx
xxxxxxxx
date   - xxxx
--------------------------------------------------------------------------------

CodePudding user response：

I revised my old answer because apparently doesn't was the correct answer for your question. My bad, I didn't readed correctly. The next one should work for you:

author[\w\W] ?date[^\r\n]

CodePudding user response：

I imagine there must be a cleaner/more efficient regex, but this pattern works with global multiline flags

   ^author(.|\s)*?date.*$