Using python re need to match string that starts and ends with two possible patterns each-CodePudding

The | symbol in regular expressions seems to divide the entire pattern, but I need to divide a smaller pattern... I want it to find a match that starts with either "Q: " or "A: ", and then ends before the next either "Q: " or "A: ". In between can be anything including newlines.

My attempt:

string = "Q: This is a question. \nQ: This is a 2nd question \non two lines. \n\nA: This is an answer. \nA: This is a 2nd answer \non two lines.\nQ: Here's another question. \nA: And another answer."

pattern = re.compile("(A: |Q: )[\w\W]*(A: |Q: |$)")

matches = pattern.finditer(string)
for match in matches:
    print('-', match.group(0))

The regex I am using is (A: |Q: )[\w\W]*(A: |Q: |$).

Here is the same string over multiple lines, just for reference:

Q: This is a question. 
Q: This is a 2nd question 
on two lines. 

A: This is an answer. 
A: This is a 2nd answer 
on two lines.
Q: Here's another question. 
A: And another answer.

So I was hoping the parenthesis would isolate the two possible patterns at the start and the three at the end, but instead it treats it like 4 separate patterns. Also it would include at the end the next A: or Q:, but hopefully you can see what I was going for. I was planning to just not use that group or something.

If it's helpful, this is for a simple study program that grabs the questions and answers from a text file to quiz the user. I was able to make it with the questions and answers being only one line each, but I'm having trouble getting an "A: " or "Q: " that has multiple lines.

CodePudding user response：

One approach could be to use a negative lookahead ?! to match a newline followed by an A: | Q: block, as follows:

^([AQ]):(?:.|\n(?![AQ]:))

You can also try it out here on the Regex Demo.

Here's another approach suggested by @Wiktor that should be a little faster:

^[AQ]:.*(?:\n (?![AQ]:). )*

A slight modification where we match .* instead of like \n (but note that this also captures blank lines at the end):

^[AQ]:.*(?:\n(?![AQ]:).*)*

CodePudding user response：

I suggest just using a for-loop for this as it's easier for me at least. To answer your question, why not just target until the period rather than the next A: | Q:? You'd probably have to use lookaheads otherwise.

(A: |Q: )[\s\S]*?\.

[\s\S] (Conventionally used to match every character though [\w\W] work as well)

*? is a lazy quantifier. It matches as few characters as it can. If we had just (A: |Q: )[\s\S]*?, then it'd only match the (A: |Q: ), but we have the ending \..

\. matches a literal period.

For the for-loop:

questions_and_answers = []
for line in string.splitlines():
    if line.startswith(("Q: ", "A: ")):
        questions_and_answers.append(line)
    else:
        questions_and_answers[-1]  = line

# ['Q: This is a question. ', 'Q: This is a 2nd question on two lines. ', 'A: This is an answer. ', 'A: This is a 2nd answer on two lines.', "Q: Here's another question. ", 'A: And another answer.']```