I'm scraping a website that was written in Polish, meaning it contains characters such as ź and ę.
When I attempt to parse the html, either using the html package or even by splitting the string of the response body, I get output like this:
���~♦�♀�����r�▬֭��↔��q���y���<p��19��lFۯ☻→Z�7��
Im currently using
bodyBytes, err := ioutil.Readall(resp.body)
if err != nil {
//handle
}
bodyString := string(bodyBytes)
In order to get the string
How can I get the text in readable format?
CodePudding user response:
on wich website are you working ? I'm getting correct characters when I'm testing on wikipedia page
package main
import (
"fmt"
"io"
"net/http"
)
func main() {
resp, err := http.Get("https://en.wikipedia.org/wiki/Polish_alphabet")
if err != nil {
// handle error
}
defer resp.Body.Close()
b, err := io.ReadAll(resp.Body)
if err != nil {
panic(err)
}
bodyStr := string(b)
fmt.Println(bodyStr)
}
<td>Ą</td>
<td>Ć</td>
<td>Ę</td>
CodePudding user response:
Update:
Since the content encoding of the response was gzip, the code below worked for getting the response as a printable string
gReader, err := gzip.NewReader(resp.Body)
if err != nil {
return err
}
gBytes, err := ioutil.ReadAll(gReader)
if err != nil {
return err
}
gReader.Close()
bodyStr := string(gBytes)