I am using the built-in html
library in Golang.
Here's the code to reproduce the issue:
package main
import (
"fmt"
"log"
"net/http"
"golang.org/x/net/html"
)
const url = "https://google.com"
func main() {
resp, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
log.Fatalf("Status code error: %d %s", resp.StatusCode, resp.Status)
}
h := html.NewTokenizer(resp.Body)
for {
if h.Next() == html.ErrorToken {
break
}
l := len(h.Token().Attr)
if l != 0 {
fmt.Println("=======")
fmt.Println("Length", l) // greater than 0
fmt.Println("Attr", h.Token().Attr) // empty all the times
}
}
}
Here's what the output looks like
=======
Length 2
Attr []
typeof Attr []html.Attribute
=======
Length 8
Attr []
typeof Attr []html.Attribute
=======
Length 1
Attr []
typeof Attr []html.Attribute
=======
Length 1
Attr []
typeof Attr []html.Attribute
go version
go version go1.17.7 linux/amd64
Why does Go think the length of h.Token().Attr
is non-zero here when the h.Token().Attr
is empty?
P.S.: saving the output of h.Token().Attr
and using it for len
and printing the contents makes everything work fine
Code:
package main
import (
"fmt"
"log"
"net/http"
"golang.org/x/net/html"
)
const url = "https://google.com"
func main() {
resp, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
log.Fatalf("Status code error: %d %s", resp.StatusCode, resp.Status)
}
h := html.NewTokenizer(resp.Body)
for {
if h.Next() == html.ErrorToken {
break
}
attrs := h.Token().Attr // save the output here and use it everywhere else
l := len(attrs)
if l != 0 {
fmt.Println("=======")
fmt.Println("Length", l)
fmt.Println("Attr", attrs)
}
}
}
Output
Length 3
Attr [{ value AJiK0e8AAAAAYtZT7PXDBRBC2BJawIxezEfmIL6Aw5Uy} { name iflsig} { type hidden}]
=======
Length 4
Attr [{ class fl sblc} { align left} { nowrap } { width 25%}]
=======
Length 1
Attr [{ href /advanced_search?hl=en-IN&authuser=0}]
=======
Length 4
Attr [{ id gbv} { name gbv} { type hidden} { value 1}]
CodePudding user response:
Tokenizer has a kind of funny interface, and you aren't allowed to call Token()
more than once between calls to Next()
. As the doc says:
In EBNF notation, the valid call sequence per token is:
Next {Raw} [ Token | Text | TagName {TagAttr} ]
Which is to say: after calling Next()
you may call Raw()
zero or more times; then you can either:
- Call
Token()
once, - Call
Text()
once, - Call
TagName()
once followed byTagAttr()
zero or more times (presumably, either not at all because you don't care about the attributes, or enough times to retrieve all of the attributes). - Or do nothing (maybe you're skipping tokens).
The results of calling things out of sequence are undefined, because the methods modify internal state — they're not pure accessors. In your first snippet you call Token()
multiple times between calls to Next()
, so the result is invalid. All of the attributes are consumed by the first call, and aren't returned by the later ones.
CodePudding user response:
It's not empty, you just need to loop over it and view the values.
package main
import (
"fmt"
"strings"
"golang.org/x/net/html"
)
func main() {
body := `
<html lang="en">
<body onl oad="fool()">
</body>
</html>
`
h := html.NewTokenizer(strings.NewReader(body))
for {
if h.Next() == html.ErrorToken {
break
}
attr := h.Token().Attr
l := len(attr)
if l != 0 {
fmt.Println("=======")
fmt.Println("Length", l) // greater than 0
for i, a := range attr {
fmt.Printf("Attr %d %v\n", i, a)
}
}
}
}
Playground: https://go.dev/play/p/lzEdppsURl0
CodePudding user response:
The (*Tokenizer).Token()
returns a new Token everytime which has a new []Attr again, In the .Token()
here the tokenizer in the next call has the start and end are the same number on line 1145 there, so it doesn't go in this loop, so the Attr will be empty next time.