Given
Word1 content1 content1 content1
content2 content2 content2
content3 content3 content3
Word2
I want to extract as groups content1, content2 and content3. Could you help to make a regex for that? I tried:
Word1[\s:]*((?P<value>[^\n] )\n) Word2
with gms flags, but it didn't help. I need regex for python re module.
CodePudding user response:
You can use
import re
text = "Word1 content1 content1 content1\n content2 content2 content2\n content3 content3 content3\nWord2"
match = re.search(r'Word1[\s:]*((?:. \n)*)Word2', text)
if match:
print([s.strip() for s in match.group(1).splitlines()])
See the Python and the regex demo.
Output:
['content1 content1 content1', 'content2 content2 content2', 'content3 content3 content3']
Details:
Word1
- aWord1
string[\s:]*
- zero or more whitespaces and colons((?:. \n)*)
- Group 1: zero or more repetitions of one or more chars other than line break chars as many as possible, followed with a newline charWord2
- aWord2
string.
Then, if there is a match, [s.strip() for s in match.group(1).splitlines()]
splits the Group 1 value into separate lines.
An alternative solution using the PyPi regex library can be
import regex
text = "Word1 content1 content1 content1\n content2 content2 content2\n content3 content3 content3\nWord2"
print( regex.findall(r'(?<=Word1[\s:]*(?s:.*?))\S(?:.*\S)?(?=(?s:.*?)\nWord2)', text) )
See the Python demo. Details:
(?<=Word1[\s:]*(?s:.*?))
- a positive lookbehind that requires aWord1
string, zero or more whitespaces or colons, and then any zero or more chars as few as possible immediately to the left of the current location\S(?:.*\S)?
- a non-whhitespace char and then any zero or more chars other than line break chars as many as possible till the last non-whitespace char on the line(?=(?s:.*?)\nWord2)
- a positive lookahead that requires any zero or more chars as few as possible and then a newline char andWord2
word to the right of the current location.
CodePudding user response:
Better extract to group everything between 2 words and then split it with new line symbol.
first_key = "Word1"
second_key = "Word2"
common_regex = r"{first_key}[\s:]*(?P<value>. ){second_key}"
regex = common_regex.format(first_key=first_key, second_key=second_key)
lines = [x.group("value").strip() for x in re.finditer(regex, text_piece, re.DOTALL)]
if lines:
lines = lines[0].split("\n")
else:
lines = []
print(lines)