Home > Software engineering >  Resolving invalid data in CSV file with Apache Commons
Resolving invalid data in CSV file with Apache Commons

Time:09-22

Using the apache commons library for parsing CSV data I encounter an error

java.lang.IllegalStateException: IOException reading next record: java.io.IOException: 
(line 46196) invalid char between encapsulated token and delimiter

I am using the setup as following:

try {
    File csvInput = getLatestFilefromDir(CSV_PATH);
    reader = new FileReader(csvInput);

    final CSVFormat csvFormat = CSVFormat.Builder.create()
            .setHeader(HEADERS)
            .setDelimiter(';')
            .setQuote('"')
            .setEscape('\\')
            .setSkipHeaderRecord(true)
            .build();

    Iterable<CSVRecord> csvRecords = csvFormat.parse(reader);

    for (CSVRecord csvRecord : csvRecords) {
        // processing
    }
} catch (Exception e) {
    log.error("Error retrieving CSV data.");
    e.printStackTrace();
}

As the error suggest the data has some defect, invalid entry :

"TABLE_NAME";"ATTRIBUTE";"VALUE"
"SWAP_LEG_TYPE";"SWAP_LEG_TYPE_DESC";"The payments (PAY or RECEIVE) of this \"Leg\" are based on the yield linked to a specific equity or an index. (or to the actual market price of the equity or the index ???)"
"CNTPTY_TYPE";"CNTPTY_TYPE_DESC";"With Local Government we mean the so called \Regional Governments or Local Authorities\\" (RGLA) as defined by the EBA (European Banking Authority).\""

Changing the data is out of my control. Assuming the backslash is used for escaping quotes as in other example, in this case is used poorly and made it to the CSV file, hopefully there should be

...Authorities\ \" (RGLA)...

Is there a way to replace string before parsing? Or what can I do to extend the CSVFormat builder to accept such data?

I am thinking of simple method to read the whole input and just do the replace string \\ for \ as this is the only instance in million lines, but that seems wrong.

CodePudding user response:

This is a slightly modified original version that should solve your issue, setQuote(null) does all magic.

    final CSVFormat csvFormat = CSVFormat.Builder.create()
            .setHeader(HEADERS)
            .setDelimiter(';')
            .setQuote(null)
            .setEscape('\\')
            .setSkipHeaderRecord(true)
            .build();
  • Related