I've been running into huge performance slowdowns when looping over strings in a large python list and passing them into SQLite SELECT statements to return values, it takes 0.5s to 0.7s per loop. (~2hrs for 10K strings)
After pervasive searching, I'm unsure of how to implement a method to maybe "batch" export the list into fewer consolidated queries in hopes of gaining speed.
I have this generated list of file names from a function:
documents = ['value1.xlsx', 'value2.docx', 'value3.txt', 'value4.csv', ...]
I also have a large 1GB database full of unique File Names and File Hashes.
def iterate_documents(cur, documents):
i = 0
output = [] # Declare Empty List
while i < len(documents):
# Returning the MD5 Hash of a Passed File Name
match = cur.execute('SELECT md5hash FROM hash_table WHERE name=?', (documents[i],)).fetchone()
# If a File Name hasn't been seen by the DB, say the hash is "Unknown"
if match is None or len(match) == 0:
match = "Unknown"
else:
match = match[0]
output.append(match) # Append the match or 'Unknown' to the Output List
i = 1
return output
Sample Output: hashes = [hash-1, hash-2, Unknown, value-4, ...]
What I do next is use the ordered output to match documents and hashes in tuples (documents[i], hashes[i])
. Example: ('value1.xlsx', 'hash-1')
So I need the iterate_documents()
values to stay in order they were passed. Thus far, brute-force looping is the best I got!
CodePudding user response:
You have a couple of options for python performance without doing much code modification.
1.) You can use PyPy
This one will be the easiest way
PyPy is a runtime interpreter that is faster than a fully interpreted language, but it's slower than a fully compiled language such as C.
2.) You can try multi threading or parallel processing
This will be more difficult
With code modifications you have a couple of options as well.
1.) Use a for loop instead of a while loop
For loops are faster in python but not in C
Comparing for loops to while loops
2.) Don't pass the variable documents
to the function
The variable documents
is an mutable value and changes as well as gets larger.
This creates problems with functions because you are recreating the var every time
CodePudding user response:
since it is only 10k items, I might just fetch the ones you need from the database once then do the match more locally:
import sqlite3
def chunks(l, n):
for i in range(0, len(l), n):
yield l[i:i n]
conn = sqlite3.connect('test.db')
conn.execute('''
CREATE TABLE IF NOT EXISTS hash_table(
name TEXT PRIMARY KEY NOT NULL,
md5hash TEXT
);
''')
conn.execute("INSERT INTO hash_table(name,md5hash) VALUES ('value1.xlsx', 'some hash of value1.xlsx');")
conn.execute("INSERT INTO hash_table(name,md5hash) VALUES ('value2.docx', 'some hash of value2.docx');")
documents = ['value1.xlsx', 'value2.docx', 'value3.txt', 'value4.csv']
lookup = {}
## -----------------------
## load the lookup in chunks due to limits of SQLLite
# ## -----------------------
chunck_size = 100
for chunc in chunks(documents, chunck_size):
sql = f"SELECT name, md5hash FROM hash_table WHERE name in ({','.join(['?']*len(chunc))})"
lookup = {**lookup, **dict(conn.execute(sql, chunc).fetchall())}
## -----------------------
doc_with_hash = {doc: lookup.get(doc, "Unknown") for doc in documents}
print(doc_with_hash)
That should give you:
{
'value1.xlsx': 'some hash of value1.xlsx',
'value2.docx': 'some hash of value2.docx',
'value3.txt': 'Unknown',
'value4.csv': 'Unknown'
}