Home > Net >  Golang html.Parse rewriting href query strings to contain &
Golang html.Parse rewriting href query strings to contain &

Time:12-14

I have the following code:

package main

import (
    "os"
    "strings"

    "golang.org/x/net/html"
)

func main() {
    myHtmlDocument := `<!DOCTYPE html>
<html>
<head>
</head>
<body>
    <a href="http://www.example.com/input?foo=bar&baz=quux">WTF</a>
</body>
</html>`

    doc, _ := html.Parse(strings.NewReader(myHtmlDocument))
    html.Render(os.Stdout, doc)
}

The html.Render function is producing the following output:

<!DOCTYPE html><html><head>

</head>
<body>
    <a href="http://www.example.com/input?foo=bar&amp;baz=quux">WTF</a>

</body></html>

Why is it rewriting the query string and converting & to &amp; (in-between bar and baz)?

Is there a way to avoid this behavior?

I'm trying to do template transformation, and I don't want it mangling my URLs.

CodePudding user response:

html.Parse wants to generate valid HTML, and the HTML spec states that an amperstand in a href attribute must be encoded.

https://www.w3.org/TR/xhtml1/guidelines.html#C_12

In both SGML and XML, the ampersand character ("&") declares the beginning of an entity reference (e.g., ® for the registered trademark symbol "®"). Unfortunately, many HTML user agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (e.g. "&"). For example, when the href attribute of the a element refers to a CGI script that takes parameters, it must be expressed as http://my.site.dom/cgi-bin/myscript.pl?class=guest&amp;name=user rather than as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user.

In this case, go is actually making your HTML better and valid

With that being said - browsers will unescape it, so the resulting url if it were to be clicked on would still be the correct one (without the &amp;, just the &:

console.log(document.querySelector('a').href)
 <a href="http://www.example.com/input?foo=bar&amp;baz=quux">WTF</a>

  • Related