How to use threading on dictionaries to improve time complexity?-CodePudding

I am new to threading. What I know is we can call threads on functions, but I want to call it on a dictionary.

I have a dictionary which have random numbers present in different indexes. I want to find the sum of all those numbers. What I want to do is basically to use a thread for every single row/index of that dictionary. That single thread will find sum of all the numbers in that specific row and then these sums of all the threads will be summed together to get me the final result.

import random
import time

li = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u"
, "v", "w", "x", "y"]

arr = {}

for k in range(0, 25):
    arr[li[k]] = [random.randrange(1, 10, 1) for i in range(1000000)]

start = time.perf_counter()

sum = 0
for k, v in arr.items():
    for value in v:
        sum  = value 

end = time.perf_counter()

print(sum)

print("Finished in: ", round(end-start, 2), " seconds")

I used to do it the simple way and it took me about 86 seconds in total (due to assigning of numbers to dictionary) and 5 seconds total to calculate the sum.

I want to improve those 5 seconds of the sum calculation by creating thread for every single index of the dictionary. Can anyone help me on this?

CodePudding user response：

So, here's an example of how you'd use multiprocessing for a "map-reduce" style summing problem.

This very much assumes each subproblem (as represented by process_key) is independent of the rest.

The final reduction (summing all key results together) is done by the main program.

import multiprocessing
import os
import string
import time
from typing import Tuple, List


def get_key_data(key: str) -> List[int]:
    # Get data for a given key from a database or wherever;
    # here we just get a big blob of random bytes.
    return list(os.urandom(1_000_000))


def process_key(key: str) -> Tuple[str, int]:
    # This function is run in a separate process,
    # so it can't access global data in the same way a function
    # in the same process could.  Program accordingly.
    key_data = get_key_data(key)
    result_for_key = sum(key_data)  # Could be heavier computation here...

    # Returning a tuple makes it easier to work with the keyed data in the main program.
    return (key, result_for_key)


def main():
    start = time.perf_counter()
    keys = list(string.ascii_lowercase)
    with multiprocessing.Pool() as p:
        results = {}
        # Since result order doesn't matter, we can use `imap_unordered` to optimize performance.
        # It would also be worth adding `chunksize=...` to spend less time in serializers.
        for key, result in p.imap_unordered(process_key, keys):  # unpacking result tuples here
            print(f"Got result {result} for key {key}")
            results[key] = result
    grand_total = sum(results.values())
    end = time.perf_counter()

    print(f"Grand total: {grand_total} in {end - start:.2f} seconds")


if __name__ == '__main__':
    main()

This prints out (something like)

Got result 127439637 for key y
Got result 127521766 for key z
Got result 127410016 for key a
Got result 127618358 for key b
Got result 127510624 for key c
Got result 127525228 for key d
Got result 127471359 for key e
Got result 127535553 for key f
Got result 127457231 for key m
Got result 127547738 for key n
Got result 127567059 for key o
Got result 127470823 for key g
Got result 127465435 for key h
Got result 127497010 for key i
Got result 127432593 for key j
Got result 127555330 for key k
Got result 127402226 for key l
Got result 127534939 for key p
Got result 127558057 for key q
Got result 127474231 for key r
Got result 127491137 for key v
Got result 127520358 for key w
Got result 127490582 for key x
Got result 127489005 for key s
Got result 127485159 for key t
Got result 127503702 for key u
Grand total: 3314975156 in 0.60 seconds

CodePudding user response：

I know...we can call threads on functions.

Nope. You can't call a thread on anything. When you write this:

thread = threading.Thread(foobar, args=(x, y, z))

You're not calling a thread. You are calling the constructor of the Thread class. The constructor makes a new Thread object, and then it's the Thread that does the calling: The Thread calls foobar(x, y, z).

What I want to do is basically to use a thread for every single row/index of that dictionary. That single thread will find sum of all the numbers in that specific row and...

Threads run code, and you have to provide the code that a thread will run in the form of a function. If you wanted a thread to "find the sum of all the numbers in a specific row..,"* then you'd have to write a function that finds the sum of all of the numbers, and then you'd have to create a new Thread that would call your function.

* Some of the other answers and comments on your question explain how Python's Global Interpreter Lock (a.k.a., the GIL) prevents you from using threads to make your program run any faster. So, the rest of this answer is fantasy because it won't make your program faster, but it does illustrate how to create threads.

Probably, you'll want to pass the dictionary and the row number to the function as arguments. Maybe you'll also want to pass it some mutable result structure (e.g., an array) into which the function can save the result.

def FindRowSum(dictionary, row, results):
    sum = 0
    for ...:
        sum = sum   ...
    results[row] = sum

...

allThreads = []
results = []
for row in range(...):
    thread = threading.Thread(FindRowSum, args=(myDictionary, row, results))
    allThreads.append(thread)

Then, further on down, if you want to wait for all of the threads to finish their work:

for thread in allThreads:
    thread.join()