I have a problem with url encodings. I am looking for instance at the following url
https://fr.wikisource.org/wiki/Poésies_(Mallarmé,_1914,_8e_éd.)
When I copy paste this url I actually get
https://fr.wikisource.org/wiki/Poésies_(Mallarmé,_1914,_8e_éd.)
And when I do urllib.parse.quote()
I actually get
https://fr.wikisource.org/wiki/Poésies_(Mallarmé,_1914,_8e_éd.)
So as you can see this last version encodes the parenthesis and maybe other symbols while the previous just encodes the language specific strange characters.
Now I do some screening of tags and need to match
Poésies_(Mallarmé,_1914,_8e_éd.)/Salut
which is the encoding of the second type.
Programm's input is of type Poésies_(Mallarmé,_1914,_8e_éd.)
, which is how the url in the search bar on the web looks like.
How do I convert this to Poésies_(Mallarmé,_1914,_8e_éd.)
which is what I want to match.
Is there any way to quest for all possible encoding types for the input expression when doing the screening ?
EDIT
I worked around with
url_title = url_title.replace("(","(").replace(")",")").replace(",",",")
but of course that's not clean as there might be other wrongly encoded characters in other input strings
CodePudding user response:
quote()
and quote_plus()
has option safe
to define what chars it should keep as original.
import urllib.parse
url = 'https://fr.wikisource.org/wiki/Poésies_(Mallarmé,_1914,_8e_éd.)'
urllib.parse.quote(url, safe=':/(),')
Result:
https://fr.wikisource.org/wiki/Poésies_(Mallarmé,_1914,_8e_éd.)
But it will quote all other chars and it may need add them to safe.
As for me you should unquote
(and eventually later quote
) both values to compare them.
import urllib.parse
text = 'Poésies_(Mallarmé,_1914,_8e_éd.)/Salut'
text = urllib.parse.unquote(text)
print(text)
Result:
Poésies_(Mallarmé,_1914,_8e_éd.)/Salut