How do you find multiple matches of a string between two different tokens with python regex?-CodePudding

I'm having trouble creating a regex expression supported by python to handle this use case.

Imagine you have a text string that is a set of questions and multiple choice answers:

Question 1: What witch-like attributes do you have?
Answer 1:
x Hat
o Pointy Nose
x Float
x Weigh more than a duck

Question 2: Where could this coconut have come from?
Answer 2:
o It migrated
x A European swallow carried it
o An African swallow carried it
x It doesn't matter

... and you would like to parse the above text for only the "x" answers to Question 1 using Regex.

If you had access to PCRE you could do something like this using the \G (last match) anchor:

(?:\G(?!^)|Question 1:)(?:(?!Question 1:|Question 2:)[\s\S])*?\K(?:x\s)([a-z] )(?=(?:(?!Question 1:)[\s\S])*Question 2:)

...or maybe even something fun using subroutines (e.g., (textbetweentokens)(?1)(textwithx).

But python doesn't support either of those regex features.

Is there any other way to solving this regex challenge?

Note: There are other questions like this on stackoverflow, but none that I could find that had answers that were usable with python-supported regex.

CodePudding user response：

You have to split your text to line to use str.startswith()

texte = """Question 1: What witch-like attributes do you have?
Answer 1:
x Hat
o Pointy Nose
x Float
x Weigh more than a duck

Question 2: Where could this coconut have come from?
Answer 2:
o It migrated
x A European swallow carried it
o An African swallow carried it
x It doesn't matter"""

lines = texte.splitlines()
for l in lines:
    if l.startswith('x'):
        print(l)

Output:

x Hat
x Float
x Weigh more than a duck
x A European swallow carried it
x It doesn't matter

CodePudding user response：

You could match each line that starts with "x" but include a look-ahead assertion that checks that the next question is question 2:

(?:^x\s)(.*)(?=\s (?:^(?!Question).*\s )*^Question 2)

Use the re.M flag so ^ matches with the start of a line.

This assumes of course that the question that precedes question 2 is question 1.

import re

s = """Question 1: What witch-like attributes do you have?
Answer 1:
x Hat
o Pointy Nose
x Float
x Weigh more than a duck

Question 2: Where could this coconut have come from?
Answer 2:
o It migrated
x A European swallow carried it
o An African swallow carried it
x It doesn't matter
"""

answers = re.findall(r"(?:^x\s)(.*)(?=\s (?:^(?!Question).*\s )*^Question 2)", s, re.M)
print(answers)

Output:

['Hat', 'Float', 'Weigh more than a duck']