I'm using Spring WebClient for getting html. The response contains polish characters such as: ą, ę, ż and so on.
After calling service i expect the response to look like this: <div>plan zajęć</div>
But the actual response looks like this: <div>plan zaj�ć</div>
- and this sign replaces all polish characters.
Here's a WebClient bean config:
@Bean
WebClient webClient() {
return WebClient.builder()
.build();
}
And here's how i use it:
Optional<String> resp = webClient.get()
.uri(uri)
.retrieve()
.bodyToMono(String.class)
.blockOptional();
And here's a link to page that i'm trying to web scrape: https://plan.polsl.pl/plan.php?winW=1000&winH=1000&type=0&id=343126158
I've no idea what to change in the WebClient configuration to get the desired effect, so I'm asking for help.
CodePudding user response:
Please show how you use WebClient. I don't know Polish character but very likely your problem is related to the encoding of the response.
You can try to specify the charset to UTF_8
and see if that helps
WebClient webClient = WebClient.create();
Mono<String> response = webClient.get()
.uri(uri)
.acceptCharset(StandardCharsets.UTF_8)
.retrieve()
.bodyToMono(String.class);
String responseString = response.block();
== Updated 1/2/2023 ==
Note that Java String is using UTF-8
encoding. That's why we attempted to request the web server to return us a document in UTF-8
encoding. Unfortunately, the web server that you specified above returns ISO-8859-2
charset even though WebClient is requesting to return UTF-8
charset. You will have to transcode the response body from ISO-8859-2
to UTF-8
charset yourself. Here is the sample code to do that. I tested it with your web server.
WebClient webClient = WebClient.create();
Mono<ByteArrayResource> responseBody = webClient.get()
.uri(uri)
.retrieve()
.bodyToMono(ByteArrayResource.class);
String responseString = new String(responseBody.block().getByteArray(), Charset.forName("ISO-8859-2"));
If you are building a generic web crawler, instead of hardcoding the above code to always transcode from ISO-8859-2
to UTF-8
, you will need to get the charset information from the Content-Type header. Most of the web server would tell you the media type as well as the charset encoding in Content-Type. Then, instead of hardcoding ISO-8859-2
in the above code, you can specify the correct charset. Here is the sample code to find the charset.
WebClient webClient = WebClient.create();
Mono<ClientResponse> response = webClient
.get()
.uri("http://example.com")
.exchange();
response.map(res -> {
String contentType = res.headers().contentType().get().toString();
String charset = null;
// parse the Content-Type header to extract the charset
Matcher m = Pattern.compile("charset=([^;] )").matcher(contentType);
if (m.find()) {
charset = m.group(1);
}
return charset;
});
Unfortunately, the web server that you specified didn't tell you the charset in Content-Type header either. In this case, you may need to look elsewhere in the response to determine the character encoding.
One place you can check is the charset attribute of the element in the HTML document. Some web servers include a element in the HTML document with a charset attribute that specifies the character encoding of the document. This is how I found out your specified document is using ISO-8859-2
charset.
WebClient doesn't have an easy way to extract the charset information from tag but you can use regular expression to extract that. Here is the sample code
WebClient webClient = WebClient.create();
Mono<String> responseBody = webClient
.get()
.uri("http://example.com")
.retrieve()
.bodyToMono(String.class);
responseBody.map(html -> {
String charset = null;
// use a regular expression to extract the charset attribute from the <meta> element
Matcher m = Pattern.compile("<meta[^>] charset=[\"']?([^\"'>] )[\"']?").matcher(html);
if (m.find()) {
charset = m.group(1);
}
return charset;
});