Home > Blockchain >  How to check whether a numeric encoded entity is a valid ISO8859-1 encoding?
How to check whether a numeric encoded entity is a valid ISO8859-1 encoding?

Time:11-23

Let's say I was given random character reference like 〹. I need a solution to check whether this a valid encoding or not.

I think I can use the Charset lib but I can't fully wrap my mind on how to come up with a solution.

CodePudding user response:

[This answer has been rewritten after further research.]

There's no simple answer to this using Charsets; see below for a complicated one.

There are simple answers using the character code, but it turns out to depend on exactly what you mean by ISO8859-1!

According to the Wikipedia page on ISO/IEC 8859-1, the character set ISO8859-1 defines only characters 32–126 and 160–255. So you could simply check for those ranges, e.g.:

fun Char.isISO8859_1() = this.toInt() in 32..126 || this.toInt() in 160..255

However, that same page also mentions the character set ISO-8859-1 (note the extra hyphen), which defines all 8-bit characters (0–255), assigning control characters to the extra ones. You could check for that with e.g.:

fun Char.isISO_8859_1() = this.toInt() in 0..255

ISO8859-1 includes all the printable characters, so if you only want to know whether a character has a defined glyph, you could use the former. However, these days most people tend to mean ISO-8859-1: that's what many web pages use (those which haven't yet moved on to UTF-8), and that's what the first 256 Unicode characters are defined as. So the latter will probably be more generally useful.

Both of the above methods are of course very short, simple, and efficient; but they only work for the one character set; and it's awkward hard-coding details of a character set, when library classes already have that information.

It seems that Charset objects are mainly aimed at encoding and decoding, so they don't provide a simple way to tell which characters are defined as such. But you can find out whether they can encode a given character. Here's the simplest way I found:

fun Char.isIn(charset: Charset) =
    try {
        charset.newEncoder()
               .onUnmappableCharacter(CodingErrorAction.REPORT)
               .encode(CharBuffer.wrap(toString()))
        true
    } catch (x: CharacterCodingException) {
        false
    }

That's really inefficient, but will work for all Charsets.

If you try this for ISO_8859_1, you'll find that it can encode all 8-bit values, i.e. 0–255. So it's clearly using the full ISO-8859-1 definition.

  • Related