Home > front end >  What parts of a URL can be URL-encoded?
What parts of a URL can be URL-encoded?

Time:05-26

My Chrome version 101 allows me to open

  • https://example.com (https://example.com, encoded except for the https://.)

but not

  • https://example.com/test (https://example.com/test, with the path delimiter / also encoded.).

Exactly what parts and what characters of a URL can be URL-encoded, according to the latest specification?

By “parts,” I mean the scheme, username, password, host, port, path, query, fragment, ., :, //, @, ?, #, et cetera.

By “what characters,” I mean “characters of what value in what part.”

CodePudding user response:

By the specification

From RFC 3986.


2.1. Percent-Encoding

….

pct-encoded = "%" HEXDIG HEXDIG

The uppercase hexadecimal digits “A” through “F” are equivalent to the lowercase digits “a” through “f,” respectively. If two URIs differ only in the case of hexadecimal digits used in percent-encoded octets, they are equivalent. For consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings.

  • Percent-encoding is case-insensitive.

2.2. Reserved Characters

reserved   = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / " " / "," / ";" / "="

The purpose of reserved characters is to provide a set of delimiting characters that are distinguishable from other data within a URI. URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent. Percent-encoding a reserved character, or decoding a percent-encoded octet that corresponds to a reserved character, will change how the URI is interpreted by most applications. Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI.

A subset of the reserved characters (gen-delims) is used as delimiters of the generic URI components described in Section 3. A component’s ABNF syntax rule will not use the reserved or gen-delims rule names directly; instead, each syntax rule lists the characters allowed within that component (i.e., not delimiting it), and any of those characters that are also in the reserved set are “reserved” for use as subcomponent delimiters within the component. Only the most common subcomponents are defined by this specification; other subcomponents may be defined by a URI scheme’s specification, or by the implementation-specific syntax of a URI’s dereferencing algorithm, provided that such subcomponents are delimited by characters in the reserved set allowed within that component.

URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component. If a reserved character is found in a URI component and no delimiting role is known for that character, then it must be interpreted as representing the data octet corresponding to that character’s encoding in US-ASCII.

  • The characters “:/?#[]@!$&'()* ,;=” are reserved characters.
  • URL scheme specifications define syntactic URL delimiters to be some characters from the reserved characters.
  • Syntactic URL delimiters are not percent-encoded.
  • The reserved characters that are not syntactic URL delimiters can be either percent-encoded or not, but are recommended to be percent-encoded.

2.3. Unreserved Characters

Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource. However, URI comparison implementations do not always perform normalization prior to comparison (see Section 6). For consistency, percent-encoded octets in the ranges of ALPHA (AZ and az), DIGIT (09), hyphen (-), period (.), underscore (_), or tilde (~) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.

6. Normalization and Comparison

…URI comparison is performed for some particular purpose. Protocols or implementations that compare URIs for different purposes will often be subject to differing design trade-offs in regards to how much effort should be spent in reducing aliased identifiers. This section describes various methods that may be used to compare URIs, the trade-offs between them, and the types of applications that might use them.

  • Characters that are allowed in a URL and not the reserved, that is, “ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~”, are unreserved characters.
  • The unreserved characters can be either percent-encoded or not, but are recommended to be not.

Summary

  • Syntactic URL delimiters → cannot be percent-encoded.
  • Other than those → can be either percent-encoded or not.
  • Percent-encoding is case-insensitive.

How the implementations would do

Some implementations don’t do complete, extensive URL normalization. For example, “https://example.com” is a valid URL by the specification, but Chrome (version 101) does not normalize it into “https://example.com” when it’s put into the omnibar.

  • Related