Home > Software engineering >  Get unicode characters as string when reading response body (Golang)
Get unicode characters as string when reading response body (Golang)

Time:10-05

I'm scraping a website that was written in Polish, meaning it contains characters such as ź and ę.

When I attempt to parse the html, either using the html package or even by splitting the string of the response body, I get output like this:

���~♦�♀�����r�▬֭��↔��q���y���<p��19��lFۯ☻→Z�7��

Im currently using

bodyBytes, err := ioutil.Readall(resp.body)
if err != nil {
  //handle
} 
bodyString := string(bodyBytes)

In order to get the string

How can I get the text in readable format?

CodePudding user response:

on wich website are you working ? I'm getting correct characters when I'm testing on wikipedia page

package main

import (
    "fmt"
    "io"
    "net/http"
)

func main() {
    resp, err := http.Get("https://en.wikipedia.org/wiki/Polish_alphabet")
    if err != nil {
        // handle error
    }
    defer resp.Body.Close()
    b, err := io.ReadAll(resp.Body)
    if err != nil {
        panic(err)
    }
    bodyStr := string(b)
    fmt.Println(bodyStr)
}

<td>Ą</td>
<td>Ć</td>
<td>Ę</td>

CodePudding user response:

Update:

Since the content encoding of the response was gzip, the code below worked for getting the response as a printable string

gReader, err := gzip.NewReader(resp.Body)
if err != nil {
    return err
}
gBytes, err := ioutil.ReadAll(gReader)
if err != nil {
    return err
}
gReader.Close()
bodyStr := string(gBytes)
  • Related