Home > Net >  Parse huge CSV file
Parse huge CSV file

Time:06-14

I have a huge CSV file

  1. need to read it
  2. validate
  3. write to db

After research, I found this solution

//configure input format using
CsvParserSettings settings = new CsvParserSettings();

//get an interator
CsvParser parser = new CsvParser(settings);
Iterator<String[]> it = parser.iterate(new File("/path/to/your.csv"), "UTF-8").iterator();

//connect to the database and create an insert statement
Connection connection = getYourDatabaseConnectionSomehow();
final int COLUMN_COUNT = 2;
PreparedStatement statement = connection.prepareStatement("INSERT INTO some_table(column1, column2) VALUES (?,?)"); 

//run batch inserts of 1000 rows per batch
int batchSize = 0;
while (it.hasNext()) {
    //get next row from parser and set values in your statement
    String[] row = it.next(); 
    //validation


    if (!row[0].matches(some regex)){
        badDataList.add(row);
        conitunue;
    }
    for(int i = 0; i < COLUMN_COUNT; i  ){ 
        if(i < row.length){
            statement.setObject(i   1, row[i]);
        } else { //row in input is shorter than COLUMN_COUNT
            statement.setObject(i   1, null);   
        }
    }

    //add the values to the batch
    statement.addBatch();
    batchSize  ;

    //once 1000 rows made into the batch, execute it
    if (batchSize == 1000) {
        statement.executeBatch();
        batchSize = 0;
    }
}
// the last batch probably won't have 1000 rows.
if (batchSize > 0) {
    statement.executeBatch();
}
// or use jook#loadArrays
 

context.loadInto("book")
      .batchAfter(500)
        .loadArrays(new ArrayList <String[]>)

However, it is still too slow because it's executing in same thread. Is there any way to do it faster with multi-threading?

CodePudding user response:

Instead of iterating records one by one, use commands such as LOAD DATA INFILE that imports data in bulk:

JDBC: CSV raw data export/import from/to remote MySQL database using streams (SELECT INTO OUTFILE / LOAD DATA INFILE)

Note: As @XtremeBaumer said each database vendor has its own command for bulk importing from files.

Validation can be done with different strategies, for example if validation is possible using SQL, you can import data to a temporary table and then select valid data to target table.

Or you can validate data using Java code then use bulk import on validated data instead of importing them one by one.

CodePudding user response:

First you should close statement and connection, use try-with.-resources.. Then check (auto)commit transactionality.

connection.setAutoCommit(true);

In the same category would be a database lock on the table, should the database be in use.

Regex is slow, instead:

if (!row[0].matches(some regex)) {

do

private static Pattern SKIP_PATTERN = Pattern.compile(some regex);
...
if (SKIP_PATTERN.matcher(row[0]).matches()) { continue; }

If there is a running number like an integer ID, the batch might be better by keeping the number in a long (statement.setLong(...)). If the value is a short finite domain, instead of 1000 different String instances, you could use an identity map of string to the same string. Not sure whethe these two measures help.

Multithreading seems dubious and should be the last resource. You could write to a queue parsing the CSV and at the same time consume from it to the database.

  • Related