Redshift JDBC batch insert works slow for multiple rows-CodePudding

I have Java code for inserting into Redshift like this:

String query = "INSERT INTO table (id, name, value) VALUES (?, ?, ?)";
PreparedStatement ps = connection.prepareStatement(query);            
for (Record record : records) {
    ps.setInt(1, record.id);
    ps.setString(2, record.name);
    ps.setInt(3, record.value);
    ps.addBatch();
}
ps.executeBatch();

records contains a few thousands items. I tried to run this code with Postgres - it inserted all of them almost instantly while with Redshift it takes 10 minutes. After that I rewrote it to the next code:

String query = "INSERT INTO table (id, name, value) VALUES ";
for (Record record : records) {
    query  = "("   record.id   ",'"   record.name   "',"   record.value   "),";
}
query = query.substring(1, query.length() - 1);
PreparedStatement ps = connection.prepareStatement(query);
ps.executeUpdate();

And this fixed performance. This code works fine for both Redshift and Postgres. My question is next - what's wrong with the first code snapshot and how I can fix it? (I assume that first code snapshot for Redshift simply ignores batching.)

CodePudding user response：

Inserting single rows multiple times is never a good plan on a columnar database. Postgres is row-based and Redshift is column-based.

Each INSERT on Postgres just makes another row but on Redshift each insert requires that the column data is read, one element is added, and then the column written back. While Redshift doesn't work on the entire column, just the last 1MB block, it has to read this block for each INSERT.

Also since Redshift is clustered and the data for your table is distributed around the cluster, each subsequent INSERT is accessing a different block on a different slice of the table. All these INSERTs that are accessing single slices of the cluster are serialized by the single-threaded nature of your code so each access to a single slice has to complete before the next INSERT can be issued.

You second code adds lots of rows of data into a single INSERT statement which is compiled and the data is sent to all slices of the database where only data for each slice is stored and rest discarded. This uses the parallelism of Redshift and only has to open the 1MB block on each slice once. However, there are still performance and scalability issues with this approach (common to approach #1 but not as bad). All the data is being sent through the query compiler and then on to every slice. This can slow down compile time and waste network bandwidth. All the data has to flow through the leader node which is responsible for many database functions and doing this for large amounts of data can lead to significant cluster wide performance issues. The amount of data you can insert in this manner is limited by the size (in characters) of the max query length (16MB). There is more but I'll stop there. While this approach is better, from Redshift's point of view, it is far from ideal.

Bottom line - Postgres is a single instance (scale-up), row-based, OLTP database designed for single row inserts and Redshift is a clustered (scale-out), column-based, OLAP database designed for parallel bulk inserts. The COPY command causes each compute node in Redshift to connect to S3 to read component files of the input data. This allows for parallel actions by the Redshift cluster, independent network access to S3, and parallel processing of the read data. (If you really want your code to run fast make it multi-threaded and write your S3 files in parallel then issue a COPY command to Redshift.)

CodePudding user response：

If you want better performance in Redshift JDBC batch insert, you can follow the code below in Java.

public void testInsert(List<TestObject> testLst) {
String my_query = new String("INSERT INTO table");
    try(Connection my_connect = config_to.getConnection()) {
    try(Statement my_statement = my_connect.createStatement()){
        for(int n=0; n<testLst.size();n  ) {
            if (n == 0) {
                query = my_query.append(getSqlForm(testLst.get(n)));
            }else if(n % 200 != 0) {
                query = my_query.append(","   getSqlForm(testLst.get(n)));
            } else {
                my_statement.addBatch(my_query.toString());
            query = new String("INSERT INTO table");
            query = my_query.append(getSqlForm(testLst.get(n)));
            }

            if(n % 1000 == 0) {
            statement.executeBatch();
            }
            n  ;
        }
    my_statement.addBatch(my_query.toString());
    my_statement.executeBatch();
    }
    } catch(Exception ex) {
    ex.printStackTrace();
    }
}
private static String getSqlForm(TestObject obj) {
String result = "('%s','%s', %s)";
result = String.format(result, Obj.getGuid(), Obj.getName(), Obj.getId());
return result;
}

In JAVA, managing the insertion of multiple rows in a single transaction is fairly challenging and necessitates some manual programming. Using Statement, you can create a sizable SQL statement and quickly insert your data. However, there is a concern here since it uses a statement rather than a prepared statement, which leaves the program vulnerable to SQL injection and somewhat lowers performance. This strategy only works for non-interactive programs.