I am trying to use urlparse Python library to parse some custom URIs.
I noticed that for some well-known schemes params
are parsed correctly:
>>> from urllib.parse import urlparse
>>> urlparse("http://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='http', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')
>>> urlparse("ftp://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='ftp', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')
...but for custom ones - they are not. params
field remains empty. Instead, params are treated as a part of path
:
>>> urlparse("scheme://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='scheme', netloc='some.domain', path='/some/nested/endpoint;param1=value1;param2=othervalue2', params='', query='query1=val1&query2=val2', fragment='fragment')
Why there is a difference in parsing depending on schema? How can I parse params within urlparse
library using custom schema?
CodePudding user response:
Can you remove that custom schemes from the url? That allways will return the params
urlparse("//some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')
CodePudding user response:
This is because urlparse
assumes that only a set of schemes will uses parameters in their URL format. You can see that check with in the source code.
if scheme in uses_params and ';' in url:
url, params = _splitparams(url)
else:
params = ''
Which means urlparse
will attempt to parse parameters only if the scheme is in uses_params
(which is a list of known schemes).
uses_params = ['', 'ftp', 'hdl', 'prospero', 'http', 'imap',
'https', 'shttp', 'rtsp', 'rtspu', 'sip', 'sips',
'mms', 'sftp', 'tel']
So to get the expected output you can append your custom scheme into uses_params
list and perform the urlparse
call again.
>>> from urllib.parse import uses_params, urlparse
>>>
>>> uses_params.append('scheme')
>>> urlparse("scheme://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='scheme', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')