With python 3.8 I want to join two parts of a URL into one. Here is an example:
domain = "https://some.domain.ch/myportal#/"
urllib.parse.urljoin(domain, "test1")
this gives the output
'https://some.domain.ch/test1'
but I expect the output
'https://some.domain.ch/myportal#/test1'
Asking just to understand.
As a workaround I will use
domain "test1"
CodePudding user response:
urllib.parse.urlparse(domain)
ParseResult(scheme='https', netloc='some.domain.ch', path='/myportal', params='', query='', fragment='/')
The problem is that you have a #
in your path, which is incorrect per the specification RFC-3986 that urllib.parse
follows.
See §3 for a diagram of the parts of an URL :
foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/ \_________/ \__/
| | | | |
scheme authority path query fragment
The path
is defined in §3.3.
Yours is /myportal
, which relates to the rules
path-absolute = "/" [ segment-nz *( "/" segment ) ]
...
segment-nz = 1*pchar
whose pchar
is defined in §A :
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
...
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
...
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / " " / "," / ";" / "="
The #
can not be a pchar
so the path
stops there.
Either remove the #
if it is not required :
>>> import urllib.parse
>>> urllib.parse.urljoin("https://some.domain.ch/myportal/", "test1")
'https://some.domain.ch/myportal/test1'
Or percent-encode it :
>>> urllib.parse.quote("#")
'#'
>>> urllib.parse.urljoin("https://some.domain.ch/myportal#/", "test1")
# ^^^
'https://some.domain.ch/myportal#/test1'