The | symbol in regular expressions seems to divide the entire pattern, but I need to divide a smaller pattern... I want it to find a match that starts with either "Q: " or "A: ", and then ends before the next either "Q: " or "A: ". In between can be anything including newlines.
My attempt:
string = "Q: This is a question. \nQ: This is a 2nd question \non two lines. \n\nA: This is an answer. \nA: This is a 2nd answer \non two lines.\nQ: Here's another question. \nA: And another answer."
pattern = re.compile("(A: |Q: )[\w\W]*(A: |Q: |$)")
matches = pattern.finditer(string)
for match in matches:
print('-', match.group(0))
The regex I am using is (A: |Q: )[\w\W]*(A: |Q: |$)
.
Here is the same string over multiple lines, just for reference:
Q: This is a question.
Q: This is a 2nd question
on two lines.
A: This is an answer.
A: This is a 2nd answer
on two lines.
Q: Here's another question.
A: And another answer.
So I was hoping the parenthesis would isolate the two possible patterns at the start and the three at the end, but instead it treats it like 4 separate patterns. Also it would include at the end the next A: or Q:, but hopefully you can see what I was going for. I was planning to just not use that group or something.
If it's helpful, this is for a simple study program that grabs the questions and answers from a text file to quiz the user. I was able to make it with the questions and answers being only one line each, but I'm having trouble getting an "A: " or "Q: " that has multiple lines.
CodePudding user response:
One approach could be to use a negative lookahead ?!
to match a newline followed by an A: | Q:
block, as follows:
^([AQ]):(?:.|\n(?![AQ]:))
You can also try it out here on the Regex Demo.
Here's another approach suggested by @Wiktor that should be a little faster:
^[AQ]:.*(?:\n (?![AQ]:). )*
A slight modification where we match .*
instead of like \n
(but note that this also captures blank lines at the end):
^[AQ]:.*(?:\n(?![AQ]:).*)*
CodePudding user response:
I suggest just using a for-loop for this as it's easier for me at least. To answer your question, why not just target until the period rather than the next A: | Q:? You'd probably have to use lookaheads otherwise.
(A: |Q: )[\s\S]*?\.
[\s\S]
(Conventionally used to match every character though [\w\W]
work as well)
*?
is a lazy quantifier. It matches as few characters as it can. If we had just (A: |Q: )[\s\S]*?
, then it'd only match the (A: |Q: )
, but we have the ending \.
.
\.
matches a literal period.
For the for-loop:
questions_and_answers = []
for line in string.splitlines():
if line.startswith(("Q: ", "A: ")):
questions_and_answers.append(line)
else:
questions_and_answers[-1] = line
# ['Q: This is a question. ', 'Q: This is a 2nd question on two lines. ', 'A: This is an answer. ', 'A: This is a 2nd answer on two lines.', "Q: Here's another question. ", 'A: And another answer.']```