I have a multiline string of the following form:
Front
(A) Text1.
(A) Text2.
(A) Text3.
(A) Text4.
(A) Text5.
End
Note that Text1, Text2 etc may contain line breaks. I wish to append the string END
after each of Text1, Text2 etc.
Let c
denote the multiline string above. I tried to use regex re.sub
to perform this:
c = re.sub("\(A\)(.*?)\n\n\(A\)" , r"(A)\1 END\n\n(A)", c, flags=re.DOTALL)
However, this only replaces every odd-numbered point. Here is the output:
Front
(A) Text1. END
(A) Text2.
(A) Text3. END
(A) Text4.
(A) Text5.
End
The last bullet point can be handled as an exception case. I'm more concerned with that only every other bullet point has END
appended at the end. I believe this is because when the second (A)
is used as the endpoint of re.sub
, Python excludes it from being a starting point.
How can I resolve this?
CodePudding user response:
Python's regular expressions support lookahead, which is good for your use case:
c = re.sub("\(A\)(.*?)\n\n(?=\(A\))" , r"(A)\1 END\n\n", c, flags=re.DOTALL)
A lookahead, denoted by (?=)
, matches the enclosed pattern but does not include it in the matched span (it is a zero-width match).
Sample:
import re
c = """Front
(A) Text1.
Foo.
Bar.
(A) Text2.
Some extra text and a fake bullet (A)
More text
(A) Text3.
(A) Text4.
(A) Text5.
End"""
c = re.sub("\(A\)(.*?)\n\n(?=\(A\))" , r"(A)\1 END\n\n", c, flags=re.DOTALL)
print(c)
prints
Front
(A) Text1.
Foo.
Bar. END
(A) Text2.
Some extra text and a fake bullet (A)
More text END
(A) Text3. END
(A) Text4. END
(A) Text5.
End
CodePudding user response:
The regex that I used to select lines starting with (A)
:
r"\(A\).*"
I then used a custom replacement function to return the original line " END"
at the end.
Here is the code:
Code:
import re
c = """Front
(A) Text1.
(A) Text2.
(A) Text3.
(A) Text4.
(A) Text5.
End"""
def rep(m):
return m.group(0) " END"
c = re.sub(r"\(A\).*", repl=rep, string=c)
print(c)
Output:
Front
(A) Text1. END
(A) Text2. END
(A) Text3. END
(A) Text4. END
(A) Text5. END
End
CodePudding user response:
You can modify your regex pattern to use LookAhead and Lookbehind which are zero-width (i.e. do not consume characters) to get around your issue of:
I believe this is because when the second (A) is used as the endpoint of re.sub"
c = re.sub("(?<=\(A\))(.*?)(?=\n\n\(A\)|\n\nEnd)" , r"\1 END", c, flags=re.DOTALL)
print(c)
Output
Front
(A) Text1. END
(A) Text2. END
(A) Text3. END
(A) Text4. END
(A) Text5. END
End