My program pulls the URLs with Jsoup then sends it to a checker function to compare the urls with a database. All URLs need to be decoded to compare them against the database. I have found this works great but there is 1 part of some URLs it cannot decode here
Detail/driller's-
' which is a single quote.
When I decode this as a String like so it works:
System.out.println(java.net.URLDecoder.decode("Detail/driller's-", StandardCharsets.UTF_8.name()));
But when I pass the URL to the function below it outputs as ' and does not change to a quote:
public static boolean checkURL(String url) { System.out.println(java.net.URLDecoder.decode(url, StandardCharsets.UTF_8.name())); }
I have tried:
Passing the String to a varible then try to decode it i.e. String url2 = url
.
Converting the String to a URL then decoding it.
using url.toString().
Encoding the String then decoding the String but the ' remains regardless.
Checking for invisible space with url.replaceAll(" ", "")
.
I can just use .replace("'","'")
but I am concerned if i do this then there might be other values it's not decoding causing other issues down the track, so if I can determine the issue I can then resolve it from occurring again.
Here is the code:
CodePudding user response:
It is a common misconception that URLDecoder is for decoding URLs. Despite the name, that is not the purpose of the class. It is actually for decoding application/x-www-form-urlencoded request bodies, which are typically the result of web page user submitting an HTML form.
The correct way to remove percent-escapes in a URL is using the URI class:
URI uri = URI.create("Detail/driller's-");
System.out.println(uri.getSchemeSpecificPart());
CodePudding user response:
Resolved the issue by decoding it twice once with
url = java.net.URLDecoder.decode(url.toLowerCase(), StandardCharsets.UTF_8.name());
Then compared it with
line.contains(java.net.URLDecoder.decode(url.toLowerCase(), StandardCharsets.UTF_8.name()))