Why does len on x/net/html Token().Attr return a non-zero value for an empty slice here?-CodePudding

I am using the built-in html library in Golang. Here's the code to reproduce the issue:

package main

import (
    "fmt"
    "log"
    "net/http"

    "golang.org/x/net/html"
)

const url = "https://google.com"

func main() {
    resp, err := http.Get(url)
    if err != nil {
        log.Fatal(err)
    }
    defer resp.Body.Close()

    if resp.StatusCode != 200 {
        log.Fatalf("Status code error: %d %s", resp.StatusCode, resp.Status)
    }

    h := html.NewTokenizer(resp.Body)

    for {
        if h.Next() == html.ErrorToken {
            break
        }

        l := len(h.Token().Attr)

        if l != 0 {
            fmt.Println("=======")
            fmt.Println("Length", l) // greater than 0
            fmt.Println("Attr", h.Token().Attr) // empty all the times
        }
    }
}

Here's what the output looks like

=======
Length 2
Attr []
typeof Attr []html.Attribute
=======
Length 8
Attr []
typeof Attr []html.Attribute
=======
Length 1
Attr []
typeof Attr []html.Attribute
=======
Length 1
Attr []
typeof Attr []html.Attribute

go version

go version go1.17.7 linux/amd64

Why does Go think the length of h.Token().Attr is non-zero here when the h.Token().Attr is empty?

P.S.: saving the output of h.Token().Attr and using it for len and printing the contents makes everything work fine

Code:

package main

import (
    "fmt"
    "log"
    "net/http"

    "golang.org/x/net/html"
)

const url = "https://google.com"

func main() {
    resp, err := http.Get(url)
    if err != nil {
        log.Fatal(err)
    }
    defer resp.Body.Close()

    if resp.StatusCode != 200 {
        log.Fatalf("Status code error: %d %s", resp.StatusCode, resp.Status)
    }

    h := html.NewTokenizer(resp.Body)

    for {
        if h.Next() == html.ErrorToken {
            break
        }

        attrs := h.Token().Attr // save the output here and use it everywhere else
        l := len(attrs)

        if l != 0 {
            fmt.Println("=======")
            fmt.Println("Length", l)
            fmt.Println("Attr", attrs)
        }
    }
}

Output

Length 3
Attr [{ value AJiK0e8AAAAAYtZT7PXDBRBC2BJawIxezEfmIL6Aw5Uy} { name iflsig} { type hidden}]
=======
Length 4
Attr [{ class fl sblc} { align left} { nowrap } { width 25%}]
=======
Length 1
Attr [{ href /advanced_search?hl=en-IN&authuser=0}]
=======
Length 4
Attr [{ id gbv} { name gbv} { type hidden} { value 1}]

CodePudding user response：

Tokenizer has a kind of funny interface, and you aren't allowed to call Token() more than once between calls to Next(). As the doc says:

In EBNF notation, the valid call sequence per token is:
Next {Raw} [ Token | Text | TagName {TagAttr} ]

Which is to say: after calling Next() you may call Raw() zero or more times; then you can either:

Call Token() once,
Call Text() once,
Call TagName() once followed by TagAttr() zero or more times (presumably, either not at all because you don't care about the attributes, or enough times to retrieve all of the attributes).
Or do nothing (maybe you're skipping tokens).

The results of calling things out of sequence are undefined, because the methods modify internal state — they're not pure accessors. In your first snippet you call Token() multiple times between calls to Next(), so the result is invalid. All of the attributes are consumed by the first call, and aren't returned by the later ones.

CodePudding user response：

It's not empty, you just need to loop over it and view the values.

package main

import (
    "fmt"
    "strings"

    "golang.org/x/net/html"
)

func main() {
    body := `
<html lang="en">
<body onl oad="fool()">
</body>
</html>
`
    h := html.NewTokenizer(strings.NewReader(body))

    for {
        if h.Next() == html.ErrorToken {
            break
        }

        attr := h.Token().Attr
        l := len(attr)

        if l != 0 {
            fmt.Println("=======")
            fmt.Println("Length", l) // greater than 0
            for i, a := range attr {
                fmt.Printf("Attr %d %v\n", i, a)
            }
        }
    }
}

Playground: https://go.dev/play/p/lzEdppsURl0

CodePudding user response：

The (*Tokenizer).Token() returns a new Token everytime which has a new []Attr again, In the .Token() here the tokenizer in the next call has the start and end are the same number on line 1145 there, so it doesn't go in this loop, so the Attr will be empty next time.