Parallelizing a code across CPU cores that iterates over a nested dictionary of a total 700K entries-CodePudding

I have the following code:

for key in test_large_images.keys():
    test_large_images[key]['avg_prob'] = 0
    sum = 0
    for value in test_large_images[key]['pred_probability']:
        print(test_large_images[key]['pred'])
        print(type(test_large_images[key]['pred'] ))
        if test_large_images[key]['pred'] == 1:
            sum  = value
    test_large_images[key]['avg_prob'] = sum/len(test_large_images[key]['pred_probability'])

and it is a dictionary of 359 large images that each could have between 200 to 8000 smaller images which I call patches. The test_large_images is a dictionary of inferences on the smaller images which also have prediction probability, large image name, patch name, etc for each patch. My goal is to find the average probability of the larger image based on the prediction probability of smaller patches prediction probabilities inside that image. When I run this loop on a smaller dataset (45K patches) for which I have saved its inferences in a pkl file, it runs very fast. But, it's been more than 130 minutes that this script has been running as you can see in Jupyter Notebook remotely on VSCode Remote (with a local client on Mac).

Is there a way I can take advantage of the 24 CPU cores I have to accelerate this nested dictionary computation?

CodePudding user response：

Don't use sum as a variable name, since it's a built in function.
The line test_large_images[key]['avg_prob'] = 0 is not needed.
PeterK is correct that your condition doesn't need to be calculated every time in the inner for loop.
Why are we printing these repeatedly, or was that just for testing?

for key in test_large_images.keys():
    add = 0
    condition = test_large_images[key]['pred'] == 1 # This is what PeterK means by take it out (of the loop).
    for value in test_large_images[key]['pred_probability']:
        # print(test_large_images[key]['pred'])
        # print(type(test_large_images[key]['pred']))
        if condition:
            add  = value
    test_large_images[key]['avg_prob'] = add/len(test_large_images[key]['pred_probability'])

Your code could be simplified to:

for key in test_large_images.keys():
    condition = test_large_images[key]['pred'] == 1
    num = sum(x for x in test_large_images[key]['pred_probability'] if condition)
    denom = len(test_large_images[key]['pred_probability'])
    test_large_images[key]['avg_prob'] = num/denom

Based off of feedback and some additional optimization:

for key in test_large_images.keys():
    if test_large_images[key]['pred'] != 1:
        test_large_images[key]['avg_prob'] = 0
        continue
    values = test_large_images[key]['pred_probability']
    test_large_images[key]['avg_prob'] = sum(values)/len(values)

Here's two different types of averaging (and I am most interested in taking the average of probabilities only over the number of entries for which the prediction was 1). I call that avg_prob_pos

for key in progress_bar(test_large_images.keys()):
    condition = test_large_images[key]['pred'] == 1
    num = sum(x for x in test_large_images[key]['pred_probability'] if condition)
    denom = len(test_large_images[key]['pred_probability'])
    count = sum(x for x in test_large_images[key]['pred'] if condition)
    if count != 0:
        test_large_images[key]['avg_prob_pos'] = num/count
    test_large_images[key]['avg_prob'] = num/denom
    
    percentage = test_large_images[key]['pred'].count(1)/len(test_large_images[key]['pred'])
    test_large_images[key]['percentage'] = percentage