I need to url encode parts of the string that do not match a regex. Current solution (below) is:
- to select what regex I match (##.*##)
- put found substrings in a list and replace them with some not encodable indexes ~~1~~
- encode everything (entire url)
- put back the elements I found
I have this code that works. But I'm sure it could be done better, with a single parse looking for parts of the strings not matching my regex. It adds a huge overhead doing this everytime.
import re
from itertools import count
import urllib.parse
def replace_parts(url):
parts = []
counter = count(0)
def replace_to(match):
match = match.group(0)
parts.append(match)
return '~~' str(next(counter)) '~~'
def replace_from(match):
return parts[next(counter)]
url = re.sub(r'##(.*?)##', replace_to, url)
url = urllib.parse.quote(url)
counter = count(0)
url = re.sub(r'~~([0-9] )~~', replace_from, url)
print (url)
url1 = "http://google.com?this_is_my_encodedurl##somethin##&email=##other##tr"
url = replace_parts(url1)
# this becomes http://google.com?this_is_my_encodedurl##somethin##
&email=##other##tr
CodePudding user response:
You could use re.sub
to match the ##.*?##
pattern, but also the text that preceded it, so that you have both categories of text as a pair. Then apply the URL encoding only on the first part in the callback function. To deal with the ending of the input, allow the second part to be either the ##.*?##
pattern or the end of the input ($
):
def replace_parts(url):
return re.sub(r'(.*?)(##.*?##|$)',
lambda m: urllib.parse.quote(m[1]) m[2],
url)
CodePudding user response:
Another option using a re.sub with a lambda using a capture group and a match with an alternation.
In the lambda check if capture group 1 exists. If it does, apply urllib.parse.quot
and then return it. If there is no group 1, then return the match.
See a regex demo for the matches and groups.
The pattern matches
##\S*?##
Match as few non whitespace chars as possible between ##|
Or((?:(?!##.*?##)\S) )
Capture in group 1 a sequence of chars that are not directly followed by##...##
Example
import re
import urllib.parse
pattern = r"##\S*?##|((?:(?!##.*?##)\S) )"
def replace_parts(url):
return re.sub(
pattern,
lambda m: urllib.parse.quote(m[1]) if m[1] else m[0],
url
)
s = "http://google.com?this_is_my_encodedurl##somethin##&email=##other##tr"
print(replace_parts(s))
Output
http://google.com?this_is_my_encodedurl##somethin##&email=##other##tr