Will queries to the same database likely be faster if run in parallel?-CodePudding

I have a series of queries which are now not paralellised, the java code looks something like this:

private List<EntityA> cacheA = Lists.newArrayList;
private List<EntityB> cacheB = Lists.newArrayList;
private List<EntityC> cacheC = Lists.newArrayList;
// ...
public class Cache {
    
    private static final String SQL_A = "select * from tableA";
    //SQL_A, SQL_B

    private List<EntityA> cacheA = Lists.newArrayList;
    private List<EntityB> cacheB = Lists.newArrayList;
    private List<EntityC> cacheC = Lists.newArrayList;

    // ...

    public void updateCache(){
        updateCacheA(this.connection);
        uopdateCacheB(this.connection);
        updateCacheC(this.connection);
        //...
    }

    private void updateCacheA(Connection conn) throws Exception {
        try (
                PreparedStatement sm = conn.prepareStatement(
                        SQL_A, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
                ResultSet rs = sm.executeQuery();
        ) {
            List<EntityA> entities = Lists.newArrayList();
            while (rs.next()) {
                EntityA entity = new EntityA(rs);            
                torHosts.add(pth);
            }
            this.cacheA = entities;
        } catch (Exception e){
            //exception handling
        }
    }
    
    // updateCacheB, updateCacheC
}

All SQLs are fairly simple queries from single table with at most a where condition, so I don't think the bottleneck would be the computing power of the Database, would running the various updates in parallel likely be faster? Could it even get worse in some cases? If yes when and why?

I'd like know about the way to analyse the problem (aside from benchmarks) more than a straight answer (as there probably isn't one), since computing power, IO capabilities ecc. of both backend server and database could be different on different production environments.

The database runs on Postgres.

CodePudding user response：

It is generally faster to run queries in parallel rather than sequentially, especially if the queries are independent of each other and the database can handle the increased load.

In your code, you can achieve this by using multiple threads to run the updateCacheA, updateCacheB, and updateCacheC methods concurrently. This can be done using the Executor framework or using Thread objects directly.

However, you should be aware that running queries in parallel can also increase the load on the database and may not always result in faster performance. It is a good idea to benchmark and test the performance of your application with and without parallel queries to determine the optimal approach for your use case.

CodePudding user response：

A concrete answer will depend on what database vendor you are using as well as the server geography and network. Before focusing on running multiple connections (and the resultant complexity), make sure you are currently accessing the database as efficiently as possible with one connection.

A good starting point would be the following:

Make sure you do not have to open a new connection every time.

In your code, it appears that you are already managing this elsewhere. Opening and closing a connection each time will cost you performance, especially if you need multiple connections in order to run your query piecewise. Assuming you are using a connection pool, this should be working fine.

Hold on to the PreparedStatement object between calls.

In your code, the PreparedStatement is automatically closed once execution exits the Try/Catch block. You are missing out on one of the benefits of having a PreparedStatement: keep it ready, not needing to be re-parsed, and set a new bind variable each time you use it.

Make sure you have a large enough fetch size.

If your query returns 3 rows with each call, this is not relevant; however, if your query can potentially return thousands of rows, you should set a high fetch size. The fetch size value defaults to 10 rows per fetch, so if you are pulling 10,000 rows, you will pay 1000 network round trips to get that data over the wire. It's best to set this to something high like 500 using setFetchSize on the statement.

Have the DBA assist in providing metrics for how long your query takes to execute on the database server.

This can help you to find out where your time is going. If the query takes 10 seconds to execute, you may or may not save time with parallel calls, but you probably should invest some effort in speeding that query up. If it takes 10 milliseconds, then you know to focus solely on things outside of the database.

Consider network captures for a deeper understanding of where your time is going

If the DBA analysis doesn't give you everything you need, you can look at network traffic by installing Wireshark on your machine and running a capture while you exercise the application will give you quite a bit of useful information about where the time is going. Look at the TCP delay both going to the DB and coming back.