I have this line with 2 html tags:
<p ><!-- ARTICOL FINAL -->
I use Python, and this regex to delete the first tag, so to remain only <!-- ARTICOL FINAL -->
THIS IS THE REGEX CODE:
if len(re.findall('(<p >)(<\!-- ARTICOL FINAL -->)', page_html, flags=re.IGNORECASE)) != 0:
page_html = re.sub('(<p >)(<\!-- ARTICOL FINAL -->)', '\2', page_html, flags=re.IGNORECASE)
counter_img = 1
Seems that replacement was made, but instead of second tag, I get this:
STX (I believe is ANSI or UTF-8 character). See the print screen:
CodePudding user response:
The string '\2'
is the ASCII character ctrl-B, also known as STX. To get a literal backslash in the substitution, use a raw string r'\2'
, or double the backslash.
There is no need to run findall
separately; re.sub
will simply do nothing if there are no matches. If you want to find out whether any substitutions took place, maybe turn to re.subn
:
page_html, count = re.subn('(<p >)(<!-- ARTICOL FINAL -->)', r'\2', page_html, flags=re.IGNORECASE)
if count:
counter_img = 1
Tangentially, notice also that !
is not a regex metacharacter, and thus does not need to be escaped with a backslash. (As you were not using a raw string for the regex, either, that backslash would also have had to be doubled in order for it to do anything. I believe that in future versions of Python, superfluous backslashes like this would even be an error.)
CodePudding user response:
Solution by: @Ramesh
re.sub('(<p >)(<\!-- ARTICOL FINAL -->)', '<!-- ARTICOL FINAL -->', page_html)
So the final Find and Replace should be:
if len(re.findall('(<p >)(<\!-- ARTICOL FINAL -->)', page_html, flags=re.IGNORECASE)) != 0:
page_html = re.sub('(<p >)(<\!-- ARTICOL FINAL -->)', '<!-- ARTICOL FINAL -->', page_html)
counter_img = 1
Or, Second solution (if you want to use r''
):
if regex.search('(<p >)(<\!-- ARTICOL FINAL -->)', page_text, flags=regex.MULTILINE) != 0:
page_html = regex.sub('(<p >)(<\!-- ARTICOL FINAL -->)', r'\2', page_html, flags=re.MULTILINE)
counter_img = 1