How to parse string datetime & timezone with Arabic-Hindu digits in Java 8?-CodePudding

I wanted to parse string datetime & timezone with Arabic-Hindu digits, so I wrote a code like this:

    String dateTime = "٢٠٢١-١١-٠٨T٠٢:٢١:٠٨ ٠٢:٠٠";
    char zeroDigit = '٠';
    Locale locale = Locale.forLanguageTag("ar");
    DateTimeFormatter pattern = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ssXXX")
            .withLocale(locale)
            .withDecimalStyle(DecimalStyle.of(locale).withZeroDigit(zeroDigit));
    ZonedDateTime parsedDateTime = ZonedDateTime.parse(dateTime, pattern);
    assert parsedDateTime != null;

But I received the exception:

java.time.format.DateTimeParseException: Text '٢٠٢١-١١-٠٨T٠٢:٢١:٠٨ ٠٢:٠٠' could not be parsed at index 19

I checked a lot of questions on Stackoverflow, but I still don't understand what I did wrong.

It works fine with dateTime = "٢٠٢١-١١-٠٨T٠٢:٢١:٠٨ 02:00" when the timezone doesn't use Arabic-Hindu digits.

CodePudding user response：

Your dateTime string is wrong, misunderstood. It obviously tries to conform to the ISO 8601 format and fails. Because the ISO 8601 format uses US-ASCII digits.

The classes of java.time (Instant, OffsetDateTime and ZonedDateTime) would parse your string without any formatter if only the digits were correct for ISO 8601. In the vast majority of cases I would take your avenue: try to parse the string as it is. Not in this case. To me it makes more sense to correct the string before parsing.

    String dateTime = "٢٠٢١-١١-٠٨T٠٢:٢١:٠٨ ٠٢:٠٠";
    char[] dateTimeChars = dateTime.toCharArray();
    for (int index = 0; index < dateTimeChars.length; index  ) {
        if (Character.isDigit(dateTimeChars[index])) {
            int digitValue = Character.getNumericValue(dateTimeChars[index]);
            dateTimeChars[index] = Character.forDigit(digitValue, 10);
        }
    }
    
    OffsetDateTime odt = OffsetDateTime.parse(CharBuffer.wrap(dateTimeChars));
    
    System.out.println(odt);

Output:

2021-11-08T02:21:08 02:00

Edit: It will be even better, of course, if you can educate the publisher of the string to use US-ASCII digits.

Edit: I know the Wikipedia article I link to below says:

Representations must be written in a combination of Arabic numerals and the specific computer characters (such as "-", ":", "T", "W", "Z") that are assigned specific meanings within the standard; …

This is one thinkable cause of the confusion. The article Arabic numerals linked to says:

Arabic numerals are the ten digits: 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9.

Edit: Thanks to @Holger for drawing my attention to CharBuffer in this context. A CharBuffer implements CharSequence, the type that the parse methods of java.time require, so saves us from converting the char array back to a String.

Links

CodePudding user response：

The error message states that the problem is at index 19 in the input string.

Character 19 is the character in your input string. This means the offset (represented by XXX in your pattern) cannot be parsed.

The problem is not the itself. The problem is that timezone offsets, like 05:00, are never localized.

The documentation doesn’t talk about this, so I had to go to the source code of DateTimeFormatterBuilder to verify it.

Inside that class is this inner class:

static final class OffsetIdPrinterParser implements DateTimePrinterParser {

In that class, we can find a parse method which has calls to the private parseHour, parseMinute, and parseSeconds methods.

Each of those methods delegates to a private parseDigits method. In that method, we can see that only ASCII digits are considered:

char ch1 = parseText.charAt(pos  );
char ch2 = parseText.charAt(pos  );
if (ch1 < '0' || ch1 > '9' || ch2 < '0' || ch2 > '9') {
    return false;
}

So, the answer here is that the timezone offset must consist of ASCII digits, regardless of the locale.