Home > OS >  Python: After FIND and Replace with re.sub , I get STX on that line
Python: After FIND and Replace with re.sub , I get STX on that line

Time:09-26

I have this line with 2 html tags:

<p ><!-- ARTICOL FINAL -->

I use Python, and this regex to delete the first tag, so to remain only <!-- ARTICOL FINAL -->

THIS IS THE REGEX CODE:

if len(re.findall('(<p >)(<\!-- ARTICOL FINAL -->)', page_html, flags=re.IGNORECASE)) != 0:
    page_html = re.sub('(<p >)(<\!-- ARTICOL FINAL -->)', '\2', page_html, flags=re.IGNORECASE)
    counter_img  = 1

Seems that replacement was made, but instead of second tag, I get this:

STX (I believe is ANSI or UTF-8 character). See the print screen:

STX

CodePudding user response:

The string '\2' is the ASCII character ctrl-B, also known as STX. To get a literal backslash in the substitution, use a raw string r'\2', or double the backslash.

There is no need to run findall separately; re.sub will simply do nothing if there are no matches. If you want to find out whether any substitutions took place, maybe turn to re.subn:

page_html, count = re.subn('(<p >)(<!-- ARTICOL FINAL -->)', r'\2', page_html, flags=re.IGNORECASE)
if count:
    counter_img  = 1

Tangentially, notice also that ! is not a regex metacharacter, and thus does not need to be escaped with a backslash. (As you were not using a raw string for the regex, either, that backslash would also have had to be doubled in order for it to do anything. I believe that in future versions of Python, superfluous backslashes like this would even be an error.)

CodePudding user response:

Solution by: @Ramesh

re.sub('(<p >)(<\!-- ARTICOL FINAL -->)', '<!-- ARTICOL FINAL -->', page_html)

So the final Find and Replace should be:

if len(re.findall('(<p >)(<\!-- ARTICOL FINAL -->)', page_html, flags=re.IGNORECASE)) != 0:
    page_html = re.sub('(<p >)(<\!-- ARTICOL FINAL -->)', '<!-- ARTICOL FINAL -->', page_html)
    counter_img  = 1

Or, Second solution (if you want to use r'' ):

if regex.search('(<p >)(<\!-- ARTICOL FINAL -->)', page_text, flags=regex.MULTILINE) != 0:
    page_html =  regex.sub('(<p >)(<\!-- ARTICOL FINAL -->)', r'\2', page_html, flags=re.MULTILINE)
    counter_img  = 1
  • Related