Home > OS >  Using python re need to match string that starts and ends with two possible patterns each
Using python re need to match string that starts and ends with two possible patterns each

Time:10-29

The | symbol in regular expressions seems to divide the entire pattern, but I need to divide a smaller pattern... I want it to find a match that starts with either "Q: " or "A: ", and then ends before the next either "Q: " or "A: ". In between can be anything including newlines.

My attempt:

string = "Q: This is a question. \nQ: This is a 2nd question \non two lines. \n\nA: This is an answer. \nA: This is a 2nd answer \non two lines.\nQ: Here's another question. \nA: And another answer."

pattern = re.compile("(A: |Q: )[\w\W]*(A: |Q: |$)")

matches = pattern.finditer(string)
for match in matches:
    print('-', match.group(0))

The regex I am using is (A: |Q: )[\w\W]*(A: |Q: |$).

Here is the same string over multiple lines, just for reference:

Q: This is a question. 
Q: This is a 2nd question 
on two lines. 

A: This is an answer. 
A: This is a 2nd answer 
on two lines.
Q: Here's another question. 
A: And another answer.

So I was hoping the parenthesis would isolate the two possible patterns at the start and the three at the end, but instead it treats it like 4 separate patterns. Also it would include at the end the next A: or Q:, but hopefully you can see what I was going for. I was planning to just not use that group or something.

If it's helpful, this is for a simple study program that grabs the questions and answers from a text file to quiz the user. I was able to make it with the questions and answers being only one line each, but I'm having trouble getting an "A: " or "Q: " that has multiple lines.

CodePudding user response:

One approach could be to use a negative lookahead ?! to match a newline followed by an A: | Q: block, as follows:

^([AQ]):(?:.|\n(?![AQ]:)) 

You can also try it out here on the Regex Demo.

Here's another approach suggested by @Wiktor that should be a little faster:

^[AQ]:.*(?:\n (?![AQ]:). )*

A slight modification where we match .* instead of like \n (but note that this also captures blank lines at the end):

^[AQ]:.*(?:\n(?![AQ]:).*)*

CodePudding user response:

I suggest just using a for-loop for this as it's easier for me at least. To answer your question, why not just target until the period rather than the next A: | Q:? You'd probably have to use lookaheads otherwise.

(A: |Q: )[\s\S]*?\.

[\s\S] (Conventionally used to match every character though [\w\W] work as well)

*? is a lazy quantifier. It matches as few characters as it can. If we had just (A: |Q: )[\s\S]*?, then it'd only match the (A: |Q: ), but we have the ending \..

\. matches a literal period.

For the for-loop:

questions_and_answers = []
for line in string.splitlines():
    if line.startswith(("Q: ", "A: ")):
        questions_and_answers.append(line)
    else:
        questions_and_answers[-1]  = line

# ['Q: This is a question. ', 'Q: This is a 2nd question on two lines. ', 'A: This is an answer. ', 'A: This is a 2nd answer on two lines.', "Q: Here's another question. ", 'A: And another answer.']```
  • Related