I have came across a strange behavior of marshaller when surrogate pairs are involved. Why JAXB marshaller adds unnecessary (and invalid) XML entity? When I try to marshall the following:
- \uD83D\uDCB3, e.g. 55357 56499 code points
Mashaller outputs 128179 code point (that is valid and represents both surrogate pairs in XML) and unnecessary 56499 (which is not a valid XML entity and represents low part of pair). How can I configure marshaller to achieve valid XML entities in output, or do I need just upgrade libraries? I am using Java 8.
Sample reproducing code:
String inputSurrogate = "\uD83D\uDCB3";
JAXBContext jaxbContext = JAXBContext.newInstance(Customer.class);
Marshaller jaxbMarshaller = jaxbContext.createMarshaller();
jaxbMarshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
jaxbMarshaller.setProperty(Marshaller.JAXB_ENCODING, "UTF-8");
StringWriter sw = new StringWriter();
Customer customer = new Customer();
customer.setText(inputSurrogate);
jaxbMarshaller.marshal(customer, sw);
String xmlString = sw.toString();
System.out.println(xmlString);
for (int i = 0; i < xmlString.length(); i ) {
int ch = xmlString.codePointAt(i);
System.out.print(ch);
System.out.print("|");
}
Output (note the |128179|56499|, the 56499 is unnecessary and invalid to my understanding):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<customer>
<text>