Home > database >  Python url encoding all or just some characters
Python url encoding all or just some characters

Time:02-15

I have a problem with url encodings. I am looking for instance at the following url

https://fr.wikisource.org/wiki/Poésies_(Mallarmé,_1914,_8e_éd.)

When I copy paste this url I actually get

https://fr.wikisource.org/wiki/Poésies_(Mallarmé,_1914,_8e_éd.)    

And when I do urllib.parse.quote() I actually get

https://fr.wikisource.org/wiki/Poésies_(Mallarmé,_1914,_8e_éd.)

So as you can see this last version encodes the parenthesis and maybe other symbols while the previous just encodes the language specific strange characters.

Now I do some screening of tags and need to match

Poésies_(Mallarmé,_1914,_8e_éd.)/Salut

which is the encoding of the second type.

Programm's input is of type Poésies_(Mallarmé,_1914,_8e_éd.), which is how the url in the search bar on the web looks like.

How do I convert this to Poésies_(Mallarmé,_1914,_8e_éd.) which is what I want to match.

Is there any way to quest for all possible encoding types for the input expression when doing the screening ?

EDIT

I worked around with

url_title = url_title.replace("(","(").replace(")",")").replace(",",",")

but of course that's not clean as there might be other wrongly encoded characters in other input strings

CodePudding user response:

quote() and quote_plus() has option safe to define what chars it should keep as original.

import urllib.parse

url = 'https://fr.wikisource.org/wiki/Poésies_(Mallarmé,_1914,_8e_éd.)'

urllib.parse.quote(url, safe=':/(),')

Result:

https://fr.wikisource.org/wiki/Poésies_(Mallarmé,_1914,_8e_éd.)

But it will quote all other chars and it may need add them to safe.


As for me you should unquote (and eventually later quote) both values to compare them.

import urllib.parse

text = 'Poésies_(Mallarmé,_1914,_8e_éd.)/Salut'
text = urllib.parse.unquote(text)

print(text)

Result:

Poésies_(Mallarmé,_1914,_8e_éd.)/Salut
  • Related