Home > Software design >  Fastest way to intersect and merge many dictionaries
Fastest way to intersect and merge many dictionaries

Time:02-18

I have 1000 dictionaries (grid_1, grid_2 ... grid_1000) stored as pickle objects (generated by some previous process) and one reference dictionary. I need to compare each of the pickled dictionaries to the reference, and then combine the results.

The input dictionaries might look like:

grid_1 = {'143741': {'467457':1,'501089':2,'903718':1,'999216':5,'1040952':2},'281092':{'1434': 67,'3345': 345}, '33123': {'4566':5,'56788':45}}

grid_2 = {'143741': {'467457':5,'501089':7,'1040952':9},'281092':{'1434': 67,'3345': 20}, '33123': {'4566':7,'56788':38}}

and the reference dictionary might look like:

grid_density_original = {'143741': {'467457':1,'501089':2,'903718':1,'999216':5,'9990':4},'281092':{'1434': 60,'3345': 3,'9991': 43}, '33123': {'56788':4}}

In the first step, we should intersect the individual grid_n dicts like:

# intersection of grid_1 and grid_density_original
assert intersect_1 == {'143741': {'467457':1,'501089':2,'903718':1,'999216':5},'281092':{'1434': 67,'3345': 345}, '33123': {'56788':45}
# intersection of grid_2 and grid_density_original
assert intersect_2 == {'143741': {'467457':5,'501089':7},'281092':{'1434': 67,'3345': 20}, '33123': {'56788':38}}

Then these results should be combined, as follows:

assert combine12 == {'143741': {'467457':[1,5],'501089':[2,7],'903718':[1,99999],'999216':[5,99999]},'281092':{'1434': [67,67],'3345': [345,20]}, '33123': {'56788':[45,38]}

This appears to b the slow part, as the inner list size increases each time a new intersect_n is added.

This is the code I have currently. My actual dictionaries have on the order of 10,000 keys, and this code takes about 4 days to run.

from collections import defaultdict
from collections import Counter
import pickle
import gc
import copy
import pickle
import scipy.stats as st
from collections import defaultdict


# grid_density_orignal is original nested dictionary we compare each of 1000 grids to: 
with open('path\grid_density_original_intercountry.pickle','rb') as handle:
    grid_density_orignal = pickle.load(handle,encoding ='latin-1')  

    
# Previous process generated 1000 grids and dump them as .pickle files : grid_1,grid_2....grid_1000 

for iteration in range(1,1001):
    # load each grid i.e.grid_1,grid_2...grid_1000 into memory sequentially 
    filename = 'path\grid_%s' %iteration
    with open(filename,'rb') as handle:
        globals()['dictt_%s' % iteration] = pickle.load(handle,encoding ='latin-1') 
        
    # Counter to store grid-grids densities: same dictionary structure as grid_density_orignal
    globals()['g_den_%s' % iteration] = defaultdict(list)
    for k,v in globals()['dictt_%s' % iteration].items():
        globals()['g_den_%s' % iteration][k] = Counter(v)
     
    # here we find the common grid-grid connections between grid_density_orignal and each of the 1000 grids
    globals()['intersect_%s' % iteration] = defaultdict(list)
    for k,v in grid_density_orignal.items():
        pergrid = defaultdict(list)
        common_grid_ids = v.keys() & globals()['g_den_%s' % iteration][k].keys()
        for gridid in common_grid_ids:
            pergrid[gridid] = globals()['g_den_%s' % iteration][k][gridid]
        globals()['intersect_%s' % iteration][k] = pergrid

    
print('All 1000 grids intersection done')


# From previous code we now have 1000 intersection grids : intersect_1,intersect_2 ...... intersect_1000
for iteration in range(1,1000):

    itnext = iteration  1         # to get next intersect out of 1000
    globals()['combine_%s%s' %(iteration,itnext)] = defaultdict(list)  # dictionary to store intermediate combine step results between 2 intersects : intersect_x and intersect_x 1
    
    
  
   
    for k,v in globals()['intersect_%s' %iteration].items():
        innr = []
        combine = defaultdict(list)
        for key in set(list(globals()['intersect_%s' % iteration][k].keys())  list(globals()['intersect_%s' % itnext][k].keys())):  # key in the union of intersect_1 , intersect_2
            
            if (isinstance(globals()['intersect_%s' % iteration][k].get(key,99999), int) and isinstance(globals()['intersect_%s' % itnext][k].get(key,99999), int)): # get key value if exists, if for e.g. a certain grid doesnt exist in intersect_1, intersect_2 we give it default of 99999 as placeholder, alos check if value is an instance of int or list as in intial step it is an int but later we get lists after combining every 2 intersects 
                
                
                combine[key] = [globals()['intersect_%s' % iteration][k].get(key,99999)]   [globals()['intersect_%s' % itnext][k].get(key,99999)]   # combine into list intersect_1, intersect_2
                
            
            if (isinstance(globals()['intersect_%s' % iteration][k].get(key,99999), list) and isinstance(globals()['intersect_%s' % itnext][k].get(key,99999), int)): # this condition will be reached after initial step of intersect_1   intersect_2
                
                combine[key] = globals()['intersect_%s' % iteration][k].get(key,99999)   [globals()['intersect_%s' % itnext][k].get(key,99999)]    # combine into list intersect_1, intersect_2

                
        globals()['combine_%s%s' %(iteration,itnext)][k] = combine
        
    globals()['intersect_%s' % itnext] = copy.copy(globals()['combine_%s%s' %(iteration,itnext)])   # copy combine dict onto intersect dict so we can continue combining this dict with the next iteration

    print('2 combined: ',iteration,itnext)
    del globals()['intersect_%s' % iteration]                      # delete older intersect, combine as we dont need them and may cause memory overflow when more dicts are in memory
    del globals()['combine_%s%s' %(iteration,itnext)]
    gc.collect()                                                   # explicitly call the garbage collector as too big for ram

            
   
print('intersection and combine done')  # at the end we have intersect_1000 with is a dict with all grid_id ids as keys and a list of densities (list size is 1000 corresponding to 1000 grids)

How can I improve the performance of the code?

CodePudding user response:

I will focus on the second loop (merging the intersect_n dicts), give some general advice, and leave the rest as an exercise. My final result is so much shorter than the original that I feel the need to break down the process into several steps. Hopefully you will learn many useful techniques along the way.

After removing blank lines, comments (we'll rewrite those later anyway) and debug traces, and cleaning up a bit of whitespace we are starting with:

for iteration in range(1, 1000):
    itnext = iteration   1
    globals()['combine_%s%s' % (iteration, itnext)] = defaultdict(list)
    for k, v in globals()['intersect_%s' % iteration].items():
        innr = []
        combine = defaultdict(list)
        for key in set(list(globals()['intersect_%s' % iteration][k].keys())   list(globals()['intersect_%s' % itnext][k].keys())):
            if (isinstance(globals()['intersect_%s' % iteration][k].get(key, 99999), int) and isinstance(globals()['intersect_%s' % itnext][k].get(key, 99999), int)):
                combine[key] = [globals()['intersect_%s' % iteration][k].get(key, 99999)]   [globals()['intersect_%s' % itnext][k].get(key, 99999)]
            if (isinstance(globals()['intersect_%s' % iteration][k].get(key, 99999), list) and isinstance(globals()['intersect_%s' % itnext][k].get(key, 99999), int)):
                combine[key] = globals()['intersect_%s' % iteration][k].get(key, 99999)   [globals()['intersect_%s' % itnext][k].get(key, 99999)]
        globals()['combine_%s%s' % (iteration, itnext)][k] = combine
    globals()['intersect_%s' % itnext] = copy.copy(globals()['combine_%s%s' % (iteration, itnext)])
    del globals()['intersect_%s' % iteration]
    del globals()['combine_%s%s' % (iteration, itnext)]
    gc.collect()

Now we can get to work.

1. Properly structuring the data

Trying to create variable variables is generally a bad idea. It also has a performance cost:

$ python -m timeit -s "global x_0; x_0 = 'test'" "globals()['x_%s' % 0]"
2000000 loops, best of 5: 193 nsec per loop
$ python -m timeit -s "global x; x = ['test']" "x[0]"
10000000 loops, best of 5: 29.1 nsec per loop

Yes, we're talking about nanoseconds, but the existing code is doing it constantly, for nearly every access. But more importantly, visually simplified code is easier to analyze for subsequent improvements.

Clearly we already know how to manipulate nested data structures; adding one more level of nesting isn't an issue. To store the intersect_n results, rather than having 1000 dynamically named variables, the obvious solution is to just make a 1000-element list, where each element is one of those results. (Note that we will start counting them from 0 rather than 1, of course.) As for globals()['combine_%s%s' % (iteration, itnext)] - that makes no sense; we don't need to create a new variable name each time through, because we're going to throw that data away at the end of the loop anyway. So let's just use a constant name.

Once the first loop is modified to give the right data (which will also look much simpler in that part), the access looks much simpler here:

for iteration in range(999):
    itnext = iteration   1
    combine_overall = defaultdict(list)
    for k, v in intersect[iteration].items():
        combine = defaultdict(list)
        for key in set(list(intersect[iteration][k].keys())   list(intersection[itnext][k].keys())):
            if (isinstance(intersect[iteration][k].get(key, 99999), int) and isinstance(intersect[itnext][k].get(key, 99999), int)):
                combine[key] = [intersect[iteration][k].get(key, 99999)]   [intersect[itnext][k].get(key, 99999)]
            if (isinstance(intersect[iteration][k].get(key, 99999), list) and isinstance(intersect[itnext][k].get(key, 99999), int)):
                combine[key] = intersect[iteration][k].get(key, 99999)   [intersect[itnext][k].get(key, 99999)]
        combine_overall[k] = combine
    intersect[itnext] = copy.copy(combine_overall)

You'll notice I removed the memory management stuff at the end. I'll discuss a better approach for that later. The del for the iteration value would mess up iterating over that list, and we don't need to delete combine_overall because we'll just replace it with a new empty defaultdict. I also sneakily removed innr = [], because the value is simply unused. Like I said: visually simpler code is easier to analyze.

2. Unnecessary type checks

All this isinstance stuff is hard to read, and time consuming especially considering all the repeated access:

$ python -m timeit -s "global x; x = {'a': {'b': {'c': 0}}}" "isinstance(x['a']['b'].get('c', 0), int)"
2000000 loops, best of 5: 138 nsec per loop
$ python -m timeit -s "global x; x = {'a': {'b': {'c': 0}}}" "x['a']['b'].get('c', 0)"
5000000 loops, best of 5: 83.9 nsec per loop

We know the exact conditions under which intersect[itnext][k].get(key, 99999) should be an int: always, or else the data is simply corrupted. (We can worry about that later, and probably by doing exception handling in the calling code.) We know the conditions under which intersect[iteration][k].get(key, 99999) should be an int or a list: it will be an int (or missing) the first time through, and a list (or missing) every other time. Fixing this will also make it easier to understand the next step.

for iteration in range(999):
    itnext = iteration   1
    combine_overall = defaultdict(list)
    for k, v in intersect[iteration].items():
        combine = defaultdict(list)
        for key in set(list(intersect[iteration][k].keys())   list(intersection[itnext][k].keys())):
            if iteration == 0:
                combine[key] = [intersect[iteration][k].get(key, 99999)]   [intersect[itnext][k].get(key, 99999)]
            else:
                combine[key] = intersect[iteration][k].get(key, [99999])   [intersect[itnext][k].get(key, 99999)]
        combine_overall[k] = combine
    intersect[itnext] = copy.copy(combine_overall)

Notice how, when the key is either a list or missing, we use a list as the default value. That's the trick to preserving type consistency and making it possible to write the code this way.

3. An unnecessary copy and unnecessary pair-wise iteration

Since combine_overall isn't referenced anywhere else, we don't actually need to copy it over the intersect[itnext] value - we could just reassign it without any aliasing issues. But better yet is to just leave it where it is. Instead of considering adjacent pairs of iteration values that we merge together pairwise, we just merge everything into combine_overall, one at a time (and set up an initial defaultdict once instead of overwriting it).

This does mean we'll have to do some setup work - instead of special-casing the first merge, we'll "merge" intersect[0] by itself into the initial state of combine_overall.

combine_overall = defaultdict(list)
for k, v in intersect[0].items():
    combine = defaultdict(list)
    for key, value in v.keys():
        combine[key] = [value]
    combine_overall[k] = combine

for iteration in range(999):
    itnext = iteration   1
    for k, v in combine_overall.items():
        combine = defaultdict(list)
        for key in set(list(combine_overall[k].keys())   list(intersection[itnext][k].keys())):
            combine[key] = combine_overall[k].get(key, [99999])   [intersect[itnext][k].get(key, 99999)]
        combine_overall[k] = combine

Notice how much more simply we can do the initial step - we know which keys we're working with, so no .gets are necessary; and there's only one dict, so no merging of key-sets is necessary. But we aren't done...

4. Some miscellaneous cleanup

Looking at this version, we can more easily notice:

  • The iteration loop doesn't use iteration at all, but only itnext - so we can fix that. Also, there's no reason to use indexes like this for a simple loop - we should directly iterate over elements.
  • combine_overall will hold dicts, not lists (as we assign the values from combine, which is a defaultdict); so defaultdict(list) makes no sense.
  • Instead of using a temporary combine to build a replacement for combine_overall[k] and then assigning it back, we could just directly modify combine_overall[k]. In this way, we would actually get benefit from the defaultdict behaviour. We actually want the default values to be defaultdicts themselves - not completely straightforward, but very doable.
  • Since we no longer need to make a distinction between the overall combined result and individual results, we can rename combine_overall to just combine to look a little cleaner.
combine = defaultdict(lambda: defaultdict(list))
for k, v in intersect[0].items():
    for key, value in v.keys():
        combine[k][key] = [value]

for to_merge in intersect[1:]:
    for k, v in combine.items():
        for key in set(list(combine[k].keys())   list(to_merge[k].keys())):
            combine[k][key] = combine[k].get(key, [99999])   [to_merge[k].get(key, 99999)]

5. Oops, there was a bug all along. Also, "special cases aren't special enough to break the rules"

Hopefully, this looks a little strange to you. Why are we using .get on a defaultdict? Why would we have this single-item placeholder value, rather than an empty list? Why do we have to do this complicated check for the possible keys to use? And do we really need to handle the first intersect value differently?

Consider what happens on the following data (using the original naming conventions):

intersect_1 = {'1': {'1': 1}}
intersect_2 = {'1': {'1': 1}}
intersect_3 = {'1': {'1': 1, '2': 1}}

With the original approach, I get a result like:

$ python -i tmp.py 
>>> intersect_3
defaultdict(<class 'list'>, {'1': defaultdict(<class 'list'>, {'2': [99999, 1], '1': [1, 1, 1]})})

Oops. intersect_3['1']['2'] only has two elements ([99999, 1]), and thus doesn't match up with intersect_3['1']['1']. That defeats the purpose of the 99999 placeholder values. The problem is that, because the value was missing multiple times at the start, multiple 99999 values should have been inserted, but only one was - the one that came from creating the initial list, when the isinstance check reported an int rather than a list, when it retrieved the 99999 with .get. That lost information: we couldn't distinguish between a missing int and a missing list.

How do we work around this? Simple: we use the same, overall key set each time - the total set of keys that should be present, which we get from grid_density_original[k]. Whenever one of those entries is missing in any of the intersect results, we write the placeholder value instead. Now we are also handling each intersect result the same way - instead of doing special setup with the first value, and merging everything else in, we are merging everything in to an empty initial state.

Instead of iterating over the .items of combine (and expecting to_merge to have the same keys), we iterate over to_merge, which makes a lot more sense. Instead of creating and assigning a list for combine[k][key], we simply append a value to the existing list (and we know there is one available, because we are using defaultdicts properly now).

Thus:

combine = defaultdict(lambda: defaultdict(list))
for to_merge in intersect:
    for k, v in to_merge.items():
        # BTW, you typo'd this as "orignal" in the original code
        for key in grid_density_original[k].keys():
            combine[k][key].append(v.get(key, 99999))

(This does mean that, if none of the intersect dicts contain a particular key, the result will contain a list of 1000 99999 values, instead of omitting the key completely. I hope that isn't an issue.)

6. Okay, but weren't you going to do something about the memory usage?

Oh, right. Take a moment to write the corresponding code for the first loop, please.

Got it? Okay, now. Set up combine first; and then each time you compute one of the would-be elements of intersect, merge it in (using the two inner loops shown here) instead of building the actual intersect list.

Oh, and I think you'll find that - since we're going to iterate over grid_density_original[k].keys() anyway - the preprocessing to remove other keys from the g_den results isn't actually needed at all now.

  • Related