Home > Net >  Python: Replace string parts that DO NOT match specific regex
Python: Replace string parts that DO NOT match specific regex

Time:10-27

I need to url encode parts of the string that do not match a regex. Current solution (below) is:

  1. to select what regex I match (##.*##)
  2. put found substrings in a list and replace them with some not encodable indexes ~~1~~
  3. encode everything (entire url)
  4. put back the elements I found

I have this code that works. But I'm sure it could be done better, with a single parse looking for parts of the strings not matching my regex. It adds a huge overhead doing this everytime.

import re
from itertools import count
import urllib.parse

def replace_parts(url):
    parts = []
    counter = count(0)
    def replace_to(match):
        match = match.group(0)
        parts.append(match)
        return '~~'   str(next(counter))   '~~'
        
    def replace_from(match):
        return parts[next(counter)]
    
    url = re.sub(r'##(.*?)##', replace_to, url)
    url = urllib.parse.quote(url)

    counter = count(0)
    url = re.sub(r'~~([0-9] )~~', replace_from, url)
    print (url)

url1 = "http://google.com?this_is_my_encodedurl##somethin##&email=##other##tr"
url = replace_parts(url1)
# this becomes http://google.com?this_is_my_encodedurl##somethin##
&email=##other##tr

CodePudding user response:

You could use re.sub to match the ##.*?## pattern, but also the text that preceded it, so that you have both categories of text as a pair. Then apply the URL encoding only on the first part in the callback function. To deal with the ending of the input, allow the second part to be either the ##.*?## pattern or the end of the input ($):

def replace_parts(url):
    return re.sub(r'(.*?)(##.*?##|$)', 
                  lambda m: urllib.parse.quote(m[1])   m[2], 
                  url)

CodePudding user response:

Another option using a re.sub with a lambda using a capture group and a match with an alternation.

In the lambda check if capture group 1 exists. If it does, apply urllib.parse.quot and then return it. If there is no group 1, then return the match.

See a regex demo for the matches and groups.

The pattern matches

  • ##\S*?## Match as few non whitespace chars as possible between ##
  • | Or
  • ((?:(?!##.*?##)\S) ) Capture in group 1 a sequence of chars that are not directly followed by ##...##

Example

import re
import urllib.parse

pattern = r"##\S*?##|((?:(?!##.*?##)\S) )"

def replace_parts(url):
    return re.sub(
        pattern,
        lambda m: urllib.parse.quote(m[1]) if m[1] else m[0],
        url
    )


s = "http://google.com?this_is_my_encodedurl##somethin##&email=##other##tr"
print(replace_parts(s))

Output

http://google.com?this_is_my_encodedurl##somethin##&email=##other##tr
  • Related