Home > front end >  Why does python's urljoin not work as expected?
Why does python's urljoin not work as expected?

Time:02-17

With python 3.8 I want to join two parts of a URL into one. Here is an example:

domain = "https://some.domain.ch/myportal#/"
urllib.parse.urljoin(domain, "test1")

this gives the output

'https://some.domain.ch/test1'

but I expect the output

'https://some.domain.ch/myportal#/test1'

Asking just to understand.

As a workaround I will use

domain   "test1"

CodePudding user response:

urllib.parse.urlparse(domain)
ParseResult(scheme='https', netloc='some.domain.ch', path='/myportal', params='', query='', fragment='/')

The problem is that you have a # in your path, which is incorrect per the specification RFC-3986 that urllib.parse follows.

See §3 for a diagram of the parts of an URL :

         foo://example.com:8042/over/there?name=ferret#nose
         \_/   \______________/\_________/ \_________/ \__/
          |           |            |            |        |
       scheme     authority       path        query   fragment

The path is defined in §3.3. Yours is /myportal, which relates to the rules

path-absolute = "/" [ segment-nz *( "/" segment ) ]
...
segment-nz    = 1*pchar

whose pchar is defined in §A :

   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
...
   pct-encoded   = "%" HEXDIG HEXDIG

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
...
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / " " / "," / ";" / "="

The # can not be a pchar so the path stops there.

Either remove the # if it is not required :

>>> import urllib.parse
>>> urllib.parse.urljoin("https://some.domain.ch/myportal/", "test1")
'https://some.domain.ch/myportal/test1'

Or percent-encode it :

>>> urllib.parse.quote("#")
'#'
>>> urllib.parse.urljoin("https://some.domain.ch/myportal#/", "test1")
#                                                        ^^^
'https://some.domain.ch/myportal#/test1'
  • Related