Performance issue when using multiple threads with sqlite3-CodePudding

I am writing a program that generates hashes for files in all subdirectories and then puts them in a database or prints them to standard output: https://github.com/cherrry9/dedup

In the latest commit, I added option for my program to use multiple threads (THREADS macro).

Here are some benchmarks that I did:

$ test() { /usr/bin/time -p ./dedup / -v 0 -c 2048 -e "/\(proc\|sys\|dev\|run\)"; }
$ make clean all THREADS=1 test
real 8.03
user 4.34
sys 4.55
$ make clean all THREADS=4 && test
real 3.94
user 7.66
sys 7.42

As you can the version compiled with THREADS=4 was 2 times faster.

Now I will use the second positional argument to specify sqlite3 database:

$ test() { /usr/bin/time -p ./dedup / test.db -v 0 -c 2048 -e "/\(proc\|sys\|dev\|run\)"; }
$ make clean all THREADS=1 && test
real 20.40
user 7.58
sys 7.29
$ rm test.db
$ make clean all THREADS=4 && test
real 21.86
user 17.17
sys 18.15

Version compiled with THREADS=4 was slower than version that used THREADS=1!

When I used second argument, in dedup.c was executed this code that inserted hashes to database:

if (sql != NULL && sql_insert(sql, entry->fpath, hash) != 0) {
// ...

sql_insert uses transactions to prevent sqlite from writing to database every time I call INSERT.

int
sql_insert(SQL *sql, const char *filename, char unsigned hash[])
{
    int errcode;

    pthread_mutex_lock(&sql->mtx);
    sqlite3_bind_text(sql->stmt, 1, filename, -1, NULL);
    sqlite3_bind_blob(sql->stmt, 2, hash, SHA256_LENGTH, NULL);

    sqlite3_step(sql->stmt);
    SQL_TRY(sqlite3_reset(sql->stmt));

    if (  sql->insertc >= INSERT_LIM) {
        SQL_TRY(sqlite3_exec(sql->database, "COMMIT;BEGIN", NULL, NULL, NULL));
        sql->insertc = 0;
    }

    pthread_mutex_unlock(&sql->mtx);
    return 0;
}

This fragment is executed for every processed file and for some reason it's blocking all threads in my program.

And here's my question, how can i prevent sqlite from blocking threads and degrading the performance of my program?

Here is dedup options explanation if you wonder what test function is doing:

1th positional argument - directory to use to generate hashes
2th positional argument - path to databases which will be used by sqlite3
-v level  - verbose level (0 means print only errors)
-c nbytes - read nbytes from each file
-e regex  - exclude directories that match regex

I'm using serialized mode in sqlite3.

CodePudding user response：

It seems that all your threads use the same database connection and statement objects. Therefore you have a race-condition (even in SERIALIZED threading model), as multiple threads are binding, stepping, and resetting the same statement. Asking 'why is it slow' becomes irrelevant until you fix this problem.

Instead you should wrap your sql_insert with a mutex to guarantee that at most one thread is accessing the database connection:

int
sql_insert(SQL *sql, const char *filename, char unsigned hash[])
{
    pthread_mutex_lock(&sql->mutex);
    // ... actual insert and exec code ...
    pthread_mutex_unlock(&sql->mutex);
    return 0;
}

Then add and initialize that mutex in your SQL structure with pthread_mutex_init.

You'll see the performance boost if your bottleneck is indeed the computation of SHA-256 rather than writing into the database. Otherwise the overhead of this mutex should be negligible and the number of threads will not have a significant effect of the run-time.