How can I boost performance when passing large python lists into SQLite SELECT statements?-CodePudding

I've been running into huge performance slowdowns when looping over strings in a large python list and passing them into SQLite SELECT statements to return values, it takes 0.5s to 0.7s per loop. (~2hrs for 10K strings)

After pervasive searching, I'm unsure of how to implement a method to maybe "batch" export the list into fewer consolidated queries in hopes of gaining speed.

I have this generated list of file names from a function:

documents = ['value1.xlsx', 'value2.docx', 'value3.txt', 'value4.csv', ...]

I also have a large 1GB database full of unique File Names and File Hashes.

def iterate_documents(cur, documents):
    i = 0
    output = [] # Declare Empty List
    while i < len(documents):
        # Returning the MD5 Hash of a Passed File Name
        match = cur.execute('SELECT md5hash FROM hash_table WHERE name=?', (documents[i],)).fetchone()

        # If a File Name hasn't been seen by the DB, say the hash is "Unknown"
        if match is None or len(match) == 0:
            match = "Unknown"
        else:
            match = match[0]

        output.append(match) # Append the match or 'Unknown' to the Output List
        i  = 1
    return output

Sample Output: hashes = [hash-1, hash-2, Unknown, value-4, ...]

What I do next is use the ordered output to match documents and hashes in tuples (documents[i], hashes[i]). Example: ('value1.xlsx', 'hash-1')

So I need the iterate_documents() values to stay in order they were passed. Thus far, brute-force looping is the best I got!

CodePudding user response：

You have a couple of options for python performance without doing much code modification.

1.) You can use PyPy

This one will be the easiest way

PyPy is a runtime interpreter that is faster than a fully interpreted language, but it's slower than a fully compiled language such as C.

PyPY Link

2.) You can try multi threading or parallel processing

This will be more difficult

Threading Docs

With code modifications you have a couple of options as well.

1.) Use a for loop instead of a while loop

For loops are faster in python but not in C

Comparing for loops to while loops

2.) Don't pass the variable documents to the function The variable documents is an mutable value and changes as well as gets larger. This creates problems with functions because you are recreating the var every time

CodePudding user response：

since it is only 10k items, I might just fetch the ones you need from the database once then do the match more locally:

import sqlite3

def chunks(l, n):
    for i in range(0, len(l), n):
        yield l[i:i   n]

conn = sqlite3.connect('test.db')

conn.execute('''
    CREATE TABLE IF NOT EXISTS hash_table(
        name TEXT PRIMARY KEY NOT NULL,
        md5hash TEXT
    );
''')
conn.execute("INSERT INTO hash_table(name,md5hash) VALUES ('value1.xlsx', 'some hash of value1.xlsx');")
conn.execute("INSERT INTO hash_table(name,md5hash) VALUES ('value2.docx', 'some hash of value2.docx');")

documents = ['value1.xlsx', 'value2.docx', 'value3.txt', 'value4.csv']
lookup = {}

## -----------------------
## load the lookup in chunks due to limits of SQLLite
# ## -----------------------
chunck_size = 100
for chunc in chunks(documents, chunck_size):
    sql = f"SELECT name, md5hash FROM hash_table WHERE name in ({','.join(['?']*len(chunc))})"
    lookup = {**lookup, **dict(conn.execute(sql, chunc).fetchall())}
## -----------------------

doc_with_hash = {doc: lookup.get(doc, "Unknown") for doc in documents}
print(doc_with_hash)

That should give you:

{
    'value1.xlsx': 'some hash of value1.xlsx',
    'value2.docx': 'some hash of value2.docx',
    'value3.txt': 'Unknown',
    'value4.csv': 'Unknown'
}