Home > Software design >  How to handle  (object replacement character) in URL
How to handle  (object replacement character) in URL

Time:08-22

Using Jsoup to scrape URLS and one of the URLS I keep getting has this  symbol in it. I have tried decoding the URL:

url = URLDecoder.decode(url, "UTF-8" );

but it still remains in the code looking like this: enter image description here

I cant find much online about this other than it is "The object replacement character, sometimes used to represent an embedded object in a document when it is converted to plain text."

But if this is the case I should be able to print the symbol if it is plain text but when I run

System.out.println("");

I get the following complication error: enter image description here

and it reverts back to the last save.

Sample URL: https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles/

NOTE: If you decode the url then compare it to the decoded url it comes back as not the same e.g.:

        String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles/", "UTF-8");
        if(url.contains("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles?/")){
            System.out.println("The same");
        }else {
            System.out.println("Not the same");
        }

CodePudding user response:

That's not a compilation error. That's the eclipse code editor telling you it can't save the source code to a file, because you have told it to save the file in a cp1252 encoding, but that encoding can't express a .

Put differently, your development environment is currently configured to store source code in the cp1252 encoding, which doesn't support the character you want, so you either configure your development environment to store source code using a more flexible encoding (such as UTF-8 the error message suggests), or avoid having that character in your source code, for instance by using its unicode escape sequence instead:

System.out.println("\ufffc");

Note that as far as the Java language and runtime are concerned,  is a character like any other, so there may not be a particular need to "handle" it. Also, I am unsure why you'd expect URLDecoder to do anything if the URL hasn't been URL-encoded to begin with.

CodePudding user response:

"ef bf bc" is a 3 bytes UTF-8 character so as the error says, there's no representation for that character in "CP1252" Windows page encoding.

An option could be to replace that percent encoding sequence with an ascii representation to make the filename for saving:

 String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles/".replace("", "-xEFxBFxBC"), "UTF-8");
url ==> "https://www.breightgroup.com/job/hse-advisor-emb ... contract-roles-xEFxBFxBC/"

Another option using CharsetDecoder

String urlDec = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles/", "UTF-8");

CharsetDecoder decoder = Charset.forName("CP1252").newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE);
String urlDec = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles/", "UTF-8");
ByteBuffer buffer = ByteBuffer.wrap(urlDec.getBytes(Charset.forName("UTF-8")));
decoder.decode(buffer).toString();

Result

"https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles/"
  • Related