Home > OS >  Parsing PDF date using Java DateTimeFormatter
Parsing PDF date using Java DateTimeFormatter

Time:10-21

I'm trying to parse the date format used in PDFs. According to this page, the format looks as follows:

D:YYYYMMDDHHmmSSOHH'mm'

Where all components except the year are optional. I assume this means the string can be cut off at any point as i.e. specifying a year and an hour without specifying a month and a day seems kind of pointless to me. Also, it would make parsing pretty much impossible.

As far as I can tell, Java does not support zone offsets containing single quotes. Therefore, the first step would be to get rid of those:

D:YYYYMMDDHHmmSSOHHmm

The resulting Java date pattern should then look like this:

['D:']uuuu[MM[dd[HH[mm[ss[X]]]]]]

And my overall code looks like this:

DateTimeFormatter formatter = DateTimeFormatter.ofPattern("['D:']uuuu[MM[dd[HH[mm[ss[X]]]]]]");
TemporalAccessor temporalAccessor = formatter.parseBest("D:20020101",
    ZonedDateTime::from,
    LocalDateTime::from,
    LocalDate::from,
    Month::from,
    Year::from
);

I would expect that to result in a LocalDate object but what I get is java.time.format.DateTimeParseException: Text 'D:20020101' could not be parsed at index 2.

I've played around a bit with that and found out that everything works fine with the optional literal at the beginning but as soon as I add optional date components, I get an exception.

Can anybody tell me what I'm doing wrong?

Thanks in advance!

CodePudding user response:

I've found a solution:

String dateString = "D:20020101120000 01'00'";
String normalized = dateString.replace("'", "");
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("['D:']ppppy[ppM[ppd[ppH[ppm[pps[X]]]]]]");
TemporalAccessor temporalAccessor = formatter.parseBest(normalized,
    OffsetDateTime::from,
    LocalDateTime::from,
    LocalDate::from,
    YearMonth::from,
    Year::from
);

As it seems, the length of the components is ambiguous and parsing of the date without any separators thus failed. When specifying a padding, the length of each component is clearly stated and the date can therefore be parsed.

At least that's my theory.

  • Related