How to find match accros multiple lines?-CodePudding

I have a file that looks like this:

#% text_encoding = utf8

:xy_name1   Text

:xy_name2   Text text text to a text. 

Text and text to text text, text and 

text provides text text text text.

:xy_name3   Text

And I want to get each entry (:ENTRY_NAME (tab \t) ENTRY_DESCRIPTION). Im using r'^([a-zA-Z0-9:_\|!\.\?%\-] )[\s\t] (.*)$' regex but it doesn't work with entries that have multiline descriptions. How can I do that?

CodePudding user response：

The following code will do the trick:

import re

data = '''
#% text_encoding = utf8

:xy_name1   Text

:xy_name2   Text text text to a text. 

Text and text to text text, text and 

text provides text text text text.

:xy_name3   Text
'''

print(re.findall(r'^:(\S )\s ([\S\s]*?)(?=\n:|\Z)',data,re.M))

The last parameter in the re.findall is a flag that makes the search a multi-line search.

^:(\S ) will match the beginning of any line followed by a colon and at least one non-space character

\s then consumes the tab and spaces before the description

([\S\s]*?) matches the description beginning with the first non-space character and including everything in its way - newlines inclusive. You can not use . here because in a multi-line search the . is not matching the newline character. That is why I used [\S\s] which matches all non-space characters and all space characters. The ? at the end makes the * non-greedy. Otherwise that group would consume everything all the way to the end of the data.

(?=\n:|\Z) marks the end of the description. This group is a positive look-ahead which matches either a newline followed by a colon (\n:) or the end of the data (\Z). A look-ahead does not consume the newline and the colon therefore they will be available for the next match of the findall.

The output of above code is

[('xy_name1', 'Text\n'), ('xy_name2', 'Text text text to a text. \n\nText and text to text text, text and \n\ntext provides text text text text.\n'), ('xy_name3', 'Text\n')]

Try it out here!

CodePudding user response：

You can use capture the ENTRY_NAME in group 1.

For the ENTRY_DESCRIPTION in group 2 you can match the rest of the line, followed by all lines that do not start with the entry name pattern.

^:([\w:|!.?%()-] )\t(.*(?:\n(?!:[\w:|!.?%()-] \t).*)*)

Regex demo | Python demo

Example

import re

pattern = r"^:([\w:|!.?%()-] )\t(.*(?:\n(?!:[\w:|!.?%()-] \t).*)*)"

s = ("#% text_encoding = utf8\n\n"
            ":xy_name1  Text\n\n"
            ":xy_name2  Text text text to a text. \n\n"
            "Text and text to text text, text and \n\n"
            "text provides text text text text.\n\n"
            ":xy_name3  Text")

print(re.findall(pattern, s, re.MULTILINE))

Output

[
  ('xy_name1', 'Text\n'), 
  ('xy_name2', 'Text text text to a text. \n\nText and text to text text, text and \n\ntext provides text text text text.\n'), 
  ('xy_name3', 'Text')
]