Home > Blockchain >  URLDecoder not decoding ' ( ' )
URLDecoder not decoding ' ( ' )


My program pulls the URLs with Jsoup then sends it to a checker function to compare the urls with a database. All URLs need to be decoded to compare them against the database. I have found this works great but there is 1 part of some URLs it cannot decode here


' which is a single quote.

When I decode this as a String like so it works:

System.out.println(java.net.URLDecoder.decode("Detail/driller's-", StandardCharsets.UTF_8.name()));

But when I pass the URL to the function below it outputs as ' and does not change to a quote:

public static boolean checkURL(String url) { System.out.println(java.net.URLDecoder.decode(url, StandardCharsets.UTF_8.name())); }

I have tried:

Passing the String to a varible then try to decode it i.e. String url2 = url.

Converting the String to a URL then decoding it.

using url.toString().

Encoding the String then decoding the String but the ' remains regardless.

Checking for invisible space with url.replaceAll(" ", "").

I can just use .replace("'","'") but I am concerned if i do this then there might be other values it's not decoding causing other issues down the track, so if I can determine the issue I can then resolve it from occurring again.

Here is the code:

enter image description here

CodePudding user response:

It is a common misconception that URLDecoder is for decoding URLs. Despite the name, that is not the purpose of the class. It is actually for decoding application/x-www-form-urlencoded request bodies, which are typically the result of web page user submitting an HTML form.

The correct way to remove percent-escapes in a URL is using the URI class:

URI uri = URI.create("Detail/driller's-");

CodePudding user response:

Resolved the issue by decoding it twice once with

url = java.net.URLDecoder.decode(url.toLowerCase(), StandardCharsets.UTF_8.name());

Then compared it with

line.contains(java.net.URLDecoder.decode(url.toLowerCase(), StandardCharsets.UTF_8.name()))

  • Related