Home > database >  How to choose URL encoding standard in Go?
How to choose URL encoding standard in Go?

Time:02-15

I have a Go client that is communicating with a server that follows RFC 1738 URL encoding rules. RFC 1738 has since been updated (replaced) by RFC 3986, which is what Go seems to be using, at least in v1.17.7.

s := "blue ~light blue"
s = url.QueryEscape(s)
fmt.Println(s) // blue+~light blue

In RFC 1738, ~ is a reserved ("unsafe") character and should be encoded as ~, whereas in RFC 3986 it's not necessary to encode ~. This is just one difference between the two RFCs, there are likely others that I've not looked into yet, which is why a naive approach of replacing ~ with ~ isn't the path I want to go down.

Can I make Go create an "RFC 1738 compatible" encoded URL? If not, are there third-party libraries that can do this, perhaps by accepting an RFC number parameter? time already does this:

t.Format(time.RFC822)
t.Format(time.RFC850)
t.Format(time.RFC1123)
t.Format(time.RFC3339)

CodePudding user response:

In RFC 1738, ~ is a reserved ("unsafe") character and should be encoded as ~

~ is not reserved. It has no special meaning in a URI.

The reserved characters in 1738 are: ";" | "/" | "?" | ":" | "@" | "&" | "=". The reserved characters in 3986 are: ":" / "/" / "?" / "#" / "[" / "]" / "@" / "!" / "$" / "&" / "'" / "(" / ")" / "*" / " " / "," / ";" / "=". The 3986 reserved set contains all the characters of the 1738 reserved set. It is a superset.

Unsafe is different, and RFC 3986 got rid of unsafe for good reason.

RFC 1738 makes characters "unsafe" because they may have special meaning to other encodings.

  • The space character is unsafe because significant spaces may disappear and insignificant spaces may be introduced when URLs are transcribed or typeset or subjected to the treatment of word-processing programs.
  • The characters "<" and ">" are unsafe because they are used as the delimiters around URLs in free text.
  • The quote mark (""") is used to delimit URLs in some systems.
  • The character "#" is unsafe and should always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it.
  • The character "%" is unsafe because it is used for encodings of other characters.
  • Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "", "^", "~", "[", "]", and "`".

That might have made sense in 1994 when things were much more lax and URLs were expected to be embedded freely in text, but here in 2022 "gateways and other transport agents are known to sometimes modify such characters" has long since been put out of use.

Nowadays it's well-established that it's the responsibility of the thing using the text to do its own escaping, so RFC 3986 got rid of unsafe characters. It's not the RFC's job to guess what other encodings might use as special characters. The thing consuming your URI has the responsibility to escape and encode it according to its rules. If they don't, that's a bug and a possibly a security problem for them.


Since ~ is not reserved; even if you encounter pre-3986 code, which was 17 years ago, it will read both ~ and ~ in a URL as ~.

If ~ has special meaning to it and it doesn't do its own escaping it's likely very broken and insecure in many other ways. It will probably also choke on UTF-8.

CodePudding user response:

Go does not provide any knobs for url.QueryEscape. It is easy enough to whip up a custom escaper for your scenario.

Start by declaring a table of the bytes that should be left as is in the result:

// noEscape[b] is true if b is in the intersection of the allowed
// bytes in RFC 1738 and HTML5 form values.  Note that RFC 1738 
// removes one byte allowed by HTML 5 -- '~'.
var noEscape = [256]bool{
    'A': true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
    'a': true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
    '0': true, true, true, true, true, true, true, true, true, true,
    '-': true,
    '_': true,
    '.': true,
}

Here's the function:

// queryEscape1738 escapes the string so it can be safely 
// placed inside a query.
func queryEscape1738(s string) string {
    percent := 0  // number of bytes to % encode
    plus := false // do we need to   encode space?
    for i := 0; i < len(s); i   {
        b := s[i]
        if b == ' ' {
            plus = true
        } else if !noEscape[b] {
            percent  
        }
    }

    // Nothing to do?
    if percent == 0 && !plus {
        return s
    }

    // Encode!
    p := make([]byte, 0, len(s) 2*percent)
    for i := 0; i < len(s); i   {
        b := s[i]
        if b == ' ' {
            p = append(p, ' ')
        } else if noEscape[b] {
            p = append(p, b)
        } else {
            p = append(p, '%', "0123456789ABCDEF"[b>>4], "0123456789ABCDEF"[b&15])
        }
    }
    return string(p)
}

All that said, it's unlikely that the server cares whether ~ is encoded or not. The typical decoder converts to space, %xx to the decoded hex value and all other byte values are used as is.

  • Related