I have a file that looks like this:
#% text_encoding = utf8
:xy_name1 Text
:xy_name2 Text text text to a text.
Text and text to text text, text and
text provides text text text text.
:xy_name3 Text
And I want to get each entry (:ENTRY_NAME (tab \t) ENTRY_DESCRIPTION).
Im using r'^([a-zA-Z0-9:_\|!\.\?%\-\(\)] )[\s\t] (.*)$'
regex but it doesn't work with entries that have multiline descriptions.
How can I do that?
CodePudding user response:
The following code will do the trick:
import re
data = '''
#% text_encoding = utf8
:xy_name1 Text
:xy_name2 Text text text to a text.
Text and text to text text, text and
text provides text text text text.
:xy_name3 Text
'''
print(re.findall(r'^:(\S )\s ([\S\s]*?)(?=\n:|\Z)',data,re.M))
The last parameter in the re.findall is a flag that makes the search a multi-line search.
^:(\S )
will match the beginning of any line followed by a colon and at least one non-space character
\s
then consumes the tab and spaces before the description
([\S\s]*?)
matches the description beginning with the first non-space character and including everything in its way - newlines inclusive. You can not use . here because in a multi-line search the . is not matching the newline character. That is why I used [\S\s] which matches all non-space characters and all space characters. The ? at the end makes the * non-greedy. Otherwise that group would consume everything all the way to the end of the data.
(?=\n:|\Z)
marks the end of the description. This group is a positive look-ahead which matches either a newline followed by a colon (\n:) or the end of the data (\Z). A look-ahead does not consume the newline and the colon therefore they will be available for the next match of the findall.
The output of above code is
[('xy_name1', 'Text\n'), ('xy_name2', 'Text text text to a text. \n\nText and text to text text, text and \n\ntext provides text text text text.\n'), ('xy_name3', 'Text\n')]
Try it out here!
CodePudding user response:
You can use capture the ENTRY_NAME in group 1.
For the ENTRY_DESCRIPTION in group 2 you can match the rest of the line, followed by all lines that do not start with the entry name pattern.
^:([\w:|!.?%()-] )\t(.*(?:\n(?!:[\w:|!.?%()-] \t).*)*)
Example
import re
pattern = r"^:([\w:|!.?%()-] )\t(.*(?:\n(?!:[\w:|!.?%()-] \t).*)*)"
s = ("#% text_encoding = utf8\n\n"
":xy_name1 Text\n\n"
":xy_name2 Text text text to a text. \n\n"
"Text and text to text text, text and \n\n"
"text provides text text text text.\n\n"
":xy_name3 Text")
print(re.findall(pattern, s, re.MULTILINE))
Output
[
('xy_name1', 'Text\n'),
('xy_name2', 'Text text text to a text. \n\nText and text to text text, text and \n\ntext provides text text text text.\n'),
('xy_name3', 'Text')
]