Home > Blockchain >  Parsing HTML with go-colly and function returns an empty slice
Parsing HTML with go-colly and function returns an empty slice

Time:05-04

I'm parsing a web site with the colly framework and something wrong is happening. I have a very basic function getweeks() to grab and return something, yet I'm getting an empty slice instead.

func getWeeks(c *colly.Collector) []string {
    var wks []string
    c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
        weekName := div.DOM.Find("span").Text()  // a string Week 1, Week 2 etc 
        wks = append(wks, weekName)  // weekName has actual value is not empty
        // If `wks` printed here it shows correctly how the slice gets populated on each iteration
    })
    return wks  // returns []
}

func main() {
    c := colly.NewCollector(
    )

    w := getWeeks(c)
    fmt.Println(w)  // []

    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64)")
    })

    c.Visit("target url")

}

CodePudding user response:

tl;dr: The slice header is updated inside OnHTML callback, but the value you print in main is the old slice header. You should work with *[]string instead.


First of all, the callback you pass to c.OnHTML will actually run only after you call c.Visit, so printing w right after getWeeks, would show an empty slice in any case.

However it would be an empty slice even by printing it after c.Visit, why?

A slice in Go is implemented as a data structure — called slice header (more info: 1, 2).

When you assign the return value of getWeeks, you're essentially copying the slice header, including its fields Data, Len and Cap. You can see it in this playground by printing the address of the slices with %p verb (using some other struct instead of go-colly to make the example self-contained):

func getWeeks(c *Foo) []string {
    var wks []string
    c.OnHTML("div.ltbluediv", func(text string) {
        weekName := text
        wks = append(wks, weekName)
    })
    fmt.Printf("%p\n", &wks)
    return wks
}

func main() {
    c := &Foo{}

    w := getWeeks(c)

    c.Visit("target url")
    fmt.Printf("%p\n", &w)

}

Prints two different memory addresses:

0xc0000ac030
0xc0000ac018

Now if you keep fishing around on Stack Overflow about slice and append behavior, you may find out that if the slice has sufficient capacity (1, 2, 3) the backing array is not reallocated.

However even if you do make sure the backing array is the same by initializing wks with sufficient capacity, the value of w is still a copy of the original slice header, therefore with 0 length. This is demonstrated in this playground, which prints:

in getWeeks reflect.SliceHeader{Data:0xc0000121b0, Len:0, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:1, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:2, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:3, Cap:3}
[]
in main reflect.SliceHeader{Data:0xc0000121b0, Len:0, Cap:3}

You could adjust the length of w by reslicing it (playground):

c.Visit("target url")
w = w[0:3]
fmt.Println(w) // [foo bar baz]

But this means that you need to know beforehand a reasonable capacity that doesn't cause reallocation, and the final length to reslice to.

Instead, return a pointer to a slice:

func getWeeks(c *colly.Collector) *[]string {
    wks := &[]string{}
    c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
        weekName := div.DOM.Find("span").Text()
        *wks = append(*wks, weekName) 
    })
    return wks
}

Or pass a pointer into getWeeks:

func getWeeks(c *colly.Collector, wks *[]string) {
    c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
        weekName := div.DOM.Find("span").Text()
        *wks = append(*wks, weekName)
    })
}

Fixed playground: https://go.dev/play/p/yhq8YYnkFsv

  • Related