multithreading or multiprocessing for encrypting multiple files-CodePudding

i have created a function enc()

def enc():
    password = bytes('asd123','utf-8')
    salt = bytes('asd123','utf-8')
    kdf = PBKDF2HMAC(
        algorithm=hashes.SHA256(),
        length=32,
        salt=salt,
        iterations=10000,
        backend=default_backend())
    key = base64.urlsafe_b64encode(kdf.derive(password))
    f = Fernet(key)

    for file in files:
        with open(file,'rb') as original_file:
            original = original_file.read()

        encrypted = f.encrypt(original)

        with open (file,'wb') as encrypted_file:
            encrypted_file.write(encrypted)

which loops through every file from files and encrypts it.

files = ['D:/folder/asd.txt',
          'D:/folder/qwe.mp4',
          'D:/folder/qwe.jpg']

I wanna use multithreading or multiprocessing to make it faster. Is it possible? Need some help with code.

I tried Multithreading

thread = threading.Thread(target=enc)
thread.start()
thread.join()

But it doesn't seem it improve the speed or time. I need some help implementing multiprocessing. Thanks.

CodePudding user response：

You need to rework your function.

Python isn’t smart enough to know which part of the code you need multiprocessed.

Most likely it’s the for loop right, you want to encrypt the files in parallel. So you can try something like this.

Define the function which needs to be run for each loop, then, create the for loop outside. Then use multiprocessing like this.

import multiprocessing

password = bytes('asd123','utf-8')
salt = bytes('asd123','utf-8')
    kdf = PBKDF2HMAC(
        algorithm=hashes.SHA256(),
        length=32,
        salt=salt,
        iterations=10000,
        backend=default_backend())
key = base64.urlsafe_b64encode(kdf.derive(password))
f = Fernet(key)

def enc(file):
    with open(file,'rb') as original_file:
        original = original_file.read()

    encrypted = f.encrypt(original)

    with open (file,'wb') as encrypted_file:
        encrypted_file.write(encrypted)
    

if __name__ == '__main__':
    jobs = []
    for file in files:
        p = multiprocessing.Process(target=enc, args=(file,))
        jobs.append(p)
        p.start()

CodePudding user response：

Threading is not the best candidate for tasks that are CPU intensive unless the task is being performed, for example, by a C-language library routine that releases the Global Interpreter Lock. In any event, you certainly will get any performance gains with multithreading or multiprocessing unless you run multiple processes in parallel.

Let's say you have N tasks and M processor to process the tasks. If the tasks were pure CPU with no I/O (not exactly your situation), there would be no advantage in starting more than M processes to work on your N tasks and for this a multiprocessing pool is the ideal situation. When there is a mix of CPU and I/O, it could be advantageous to have a pool size greater than M, even possibly as large as N if there is a lot of I/O and very little CPU. But in that case it would be better to actually use a combination of a multithreading pool and a multiprocessing pool (of size M) where the multithreading pool was used for all of the I/O work and the multiprocessing pool for the CPU computations. The following code shows that technique:

from multiprocessing.pool import Pool, ThreadPool
from multiprocessing import cpu_count
from functools import partial

def enc(f, process_pool, file):
    with open(file,'rb') as original_file:
        original = original_file.read()

    encrypted = process_pool.apply(f.encrypt, args=(original,))

    with open (file,'wb') as encrypted_file:
        encrypted_file.write(encrypted)


def main():

    # initialize f:
    password = bytes('asd123','utf-8')
    salt = bytes('asd123','utf-8')
    kdf = PBKDF2HMAC(
        algorithm=hashes.SHA256(),
        length=32,
        salt=salt,
        iterations=10000,
        backend=default_backend())
    key = base64.urlsafe_b64encode(kdf.derive(password))
    f = Fernet(key)

    files = ['D:/folder/asd.txt',
              'D:/folder/qwe.mp4',
              'D:/folder/qwe.jpg']

    # compute number of processes in our pool
    # the lesser of number of files to process and the number of cores we have:
    pool_size = min(cpu_count(), len(files))
    # create process pool:
    process_pool = Pool(pool_size)
    # create thread pool:
    thread_pool = ThreadPool(len(files))
    worker = partial(enc, f, process_pool)
    thread_pool.map(worker, files)

if __name__ == '__main__':
    main()