I have 1000 dictionaries (grid_1
, grid_2
... grid_1000
) stored as pickle objects (generated by some previous process) and one reference dictionary. I need to compare each of the pickled dictionaries to the reference, and then combine the results.
The input dictionaries might look like:
grid_1 = {'143741': {'467457':1,'501089':2,'903718':1,'999216':5,'1040952':2},'281092':{'1434': 67,'3345': 345}, '33123': {'4566':5,'56788':45}}
grid_2 = {'143741': {'467457':5,'501089':7,'1040952':9},'281092':{'1434': 67,'3345': 20}, '33123': {'4566':7,'56788':38}}
and the reference dictionary might look like:
grid_density_original = {'143741': {'467457':1,'501089':2,'903718':1,'999216':5,'9990':4},'281092':{'1434': 60,'3345': 3,'9991': 43}, '33123': {'56788':4}}
In the first step, we should intersect the individual grid_n
dicts like:
# intersection of grid_1 and grid_density_original
assert intersect_1 == {'143741': {'467457':1,'501089':2,'903718':1,'999216':5},'281092':{'1434': 67,'3345': 345}, '33123': {'56788':45}
# intersection of grid_2 and grid_density_original
assert intersect_2 == {'143741': {'467457':5,'501089':7},'281092':{'1434': 67,'3345': 20}, '33123': {'56788':38}}
Then these results should be combined, as follows:
assert combine12 == {'143741': {'467457':[1,5],'501089':[2,7],'903718':[1,99999],'999216':[5,99999]},'281092':{'1434': [67,67],'3345': [345,20]}, '33123': {'56788':[45,38]}
This appears to b the slow part, as the inner list size increases each time a new intersect_n
is added.
This is the code I have currently. My actual dictionaries have on the order of 10,000 keys, and this code takes about 4 days to run.
from collections import defaultdict
from collections import Counter
import pickle
import gc
import copy
import pickle
import scipy.stats as st
from collections import defaultdict
# grid_density_orignal is original nested dictionary we compare each of 1000 grids to:
with open('path\grid_density_original_intercountry.pickle','rb') as handle:
grid_density_orignal = pickle.load(handle,encoding ='latin-1')
# Previous process generated 1000 grids and dump them as .pickle files : grid_1,grid_2....grid_1000
for iteration in range(1,1001):
# load each grid i.e.grid_1,grid_2...grid_1000 into memory sequentially
filename = 'path\grid_%s' %iteration
with open(filename,'rb') as handle:
globals()['dictt_%s' % iteration] = pickle.load(handle,encoding ='latin-1')
# Counter to store grid-grids densities: same dictionary structure as grid_density_orignal
globals()['g_den_%s' % iteration] = defaultdict(list)
for k,v in globals()['dictt_%s' % iteration].items():
globals()['g_den_%s' % iteration][k] = Counter(v)
# here we find the common grid-grid connections between grid_density_orignal and each of the 1000 grids
globals()['intersect_%s' % iteration] = defaultdict(list)
for k,v in grid_density_orignal.items():
pergrid = defaultdict(list)
common_grid_ids = v.keys() & globals()['g_den_%s' % iteration][k].keys()
for gridid in common_grid_ids:
pergrid[gridid] = globals()['g_den_%s' % iteration][k][gridid]
globals()['intersect_%s' % iteration][k] = pergrid
print('All 1000 grids intersection done')
# From previous code we now have 1000 intersection grids : intersect_1,intersect_2 ...... intersect_1000
for iteration in range(1,1000):
itnext = iteration 1 # to get next intersect out of 1000
globals()['combine_%s%s' %(iteration,itnext)] = defaultdict(list) # dictionary to store intermediate combine step results between 2 intersects : intersect_x and intersect_x 1
for k,v in globals()['intersect_%s' %iteration].items():
innr = []
combine = defaultdict(list)
for key in set(list(globals()['intersect_%s' % iteration][k].keys()) list(globals()['intersect_%s' % itnext][k].keys())): # key in the union of intersect_1 , intersect_2
if (isinstance(globals()['intersect_%s' % iteration][k].get(key,99999), int) and isinstance(globals()['intersect_%s' % itnext][k].get(key,99999), int)): # get key value if exists, if for e.g. a certain grid doesnt exist in intersect_1, intersect_2 we give it default of 99999 as placeholder, alos check if value is an instance of int or list as in intial step it is an int but later we get lists after combining every 2 intersects
combine[key] = [globals()['intersect_%s' % iteration][k].get(key,99999)] [globals()['intersect_%s' % itnext][k].get(key,99999)] # combine into list intersect_1, intersect_2
if (isinstance(globals()['intersect_%s' % iteration][k].get(key,99999), list) and isinstance(globals()['intersect_%s' % itnext][k].get(key,99999), int)): # this condition will be reached after initial step of intersect_1 intersect_2
combine[key] = globals()['intersect_%s' % iteration][k].get(key,99999) [globals()['intersect_%s' % itnext][k].get(key,99999)] # combine into list intersect_1, intersect_2
globals()['combine_%s%s' %(iteration,itnext)][k] = combine
globals()['intersect_%s' % itnext] = copy.copy(globals()['combine_%s%s' %(iteration,itnext)]) # copy combine dict onto intersect dict so we can continue combining this dict with the next iteration
print('2 combined: ',iteration,itnext)
del globals()['intersect_%s' % iteration] # delete older intersect, combine as we dont need them and may cause memory overflow when more dicts are in memory
del globals()['combine_%s%s' %(iteration,itnext)]
gc.collect() # explicitly call the garbage collector as too big for ram
print('intersection and combine done') # at the end we have intersect_1000 with is a dict with all grid_id ids as keys and a list of densities (list size is 1000 corresponding to 1000 grids)
How can I improve the performance of the code?
CodePudding user response:
I will focus on the second loop (merging the intersect_n
dicts), give some general advice, and leave the rest as an exercise. My final result is so much shorter than the original that I feel the need to break down the process into several steps. Hopefully you will learn many useful techniques along the way.
After removing blank lines, comments (we'll rewrite those later anyway) and debug traces, and cleaning up a bit of whitespace we are starting with:
for iteration in range(1, 1000):
itnext = iteration 1
globals()['combine_%s%s' % (iteration, itnext)] = defaultdict(list)
for k, v in globals()['intersect_%s' % iteration].items():
innr = []
combine = defaultdict(list)
for key in set(list(globals()['intersect_%s' % iteration][k].keys()) list(globals()['intersect_%s' % itnext][k].keys())):
if (isinstance(globals()['intersect_%s' % iteration][k].get(key, 99999), int) and isinstance(globals()['intersect_%s' % itnext][k].get(key, 99999), int)):
combine[key] = [globals()['intersect_%s' % iteration][k].get(key, 99999)] [globals()['intersect_%s' % itnext][k].get(key, 99999)]
if (isinstance(globals()['intersect_%s' % iteration][k].get(key, 99999), list) and isinstance(globals()['intersect_%s' % itnext][k].get(key, 99999), int)):
combine[key] = globals()['intersect_%s' % iteration][k].get(key, 99999) [globals()['intersect_%s' % itnext][k].get(key, 99999)]
globals()['combine_%s%s' % (iteration, itnext)][k] = combine
globals()['intersect_%s' % itnext] = copy.copy(globals()['combine_%s%s' % (iteration, itnext)])
del globals()['intersect_%s' % iteration]
del globals()['combine_%s%s' % (iteration, itnext)]
gc.collect()
Now we can get to work.
1. Properly structuring the data
Trying to create variable variables is generally a bad idea. It also has a performance cost:
$ python -m timeit -s "global x_0; x_0 = 'test'" "globals()['x_%s' % 0]"
2000000 loops, best of 5: 193 nsec per loop
$ python -m timeit -s "global x; x = ['test']" "x[0]"
10000000 loops, best of 5: 29.1 nsec per loop
Yes, we're talking about nanoseconds, but the existing code is doing it constantly, for nearly every access. But more importantly, visually simplified code is easier to analyze for subsequent improvements.
Clearly we already know how to manipulate nested data structures; adding one more level of nesting isn't an issue. To store the intersect_n
results, rather than having 1000 dynamically named variables, the obvious solution is to just make a 1000-element list, where each element is one of those results. (Note that we will start counting them from 0 rather than 1, of course.) As for globals()['combine_%s%s' % (iteration, itnext)]
- that makes no sense; we don't need to create a new variable name each time through, because we're going to throw that data away at the end of the loop anyway. So let's just use a constant name.
Once the first loop is modified to give the right data (which will also look much simpler in that part), the access looks much simpler here:
for iteration in range(999):
itnext = iteration 1
combine_overall = defaultdict(list)
for k, v in intersect[iteration].items():
combine = defaultdict(list)
for key in set(list(intersect[iteration][k].keys()) list(intersection[itnext][k].keys())):
if (isinstance(intersect[iteration][k].get(key, 99999), int) and isinstance(intersect[itnext][k].get(key, 99999), int)):
combine[key] = [intersect[iteration][k].get(key, 99999)] [intersect[itnext][k].get(key, 99999)]
if (isinstance(intersect[iteration][k].get(key, 99999), list) and isinstance(intersect[itnext][k].get(key, 99999), int)):
combine[key] = intersect[iteration][k].get(key, 99999) [intersect[itnext][k].get(key, 99999)]
combine_overall[k] = combine
intersect[itnext] = copy.copy(combine_overall)
You'll notice I removed the memory management stuff at the end. I'll discuss a better approach for that later. The del
for the iteration
value would mess up iterating over that list, and we don't need to delete combine_overall
because we'll just replace it with a new empty defaultdict. I also sneakily removed innr = []
, because the value is simply unused. Like I said: visually simpler code is easier to analyze.
2. Unnecessary type checks
All this isinstance
stuff is hard to read, and time consuming especially considering all the repeated access:
$ python -m timeit -s "global x; x = {'a': {'b': {'c': 0}}}" "isinstance(x['a']['b'].get('c', 0), int)"
2000000 loops, best of 5: 138 nsec per loop
$ python -m timeit -s "global x; x = {'a': {'b': {'c': 0}}}" "x['a']['b'].get('c', 0)"
5000000 loops, best of 5: 83.9 nsec per loop
We know the exact conditions under which intersect[itnext][k].get(key, 99999)
should be an int
: always, or else the data is simply corrupted. (We can worry about that later, and probably by doing exception handling in the calling code.) We know the conditions under which intersect[iteration][k].get(key, 99999)
should be an int
or a list
: it will be an int
(or missing) the first time through, and a list
(or missing) every other time. Fixing this will also make it easier to understand the next step.
for iteration in range(999):
itnext = iteration 1
combine_overall = defaultdict(list)
for k, v in intersect[iteration].items():
combine = defaultdict(list)
for key in set(list(intersect[iteration][k].keys()) list(intersection[itnext][k].keys())):
if iteration == 0:
combine[key] = [intersect[iteration][k].get(key, 99999)] [intersect[itnext][k].get(key, 99999)]
else:
combine[key] = intersect[iteration][k].get(key, [99999]) [intersect[itnext][k].get(key, 99999)]
combine_overall[k] = combine
intersect[itnext] = copy.copy(combine_overall)
Notice how, when the key is either a list or missing, we use a list as the default value. That's the trick to preserving type consistency and making it possible to write the code this way.
3. An unnecessary copy and unnecessary pair-wise iteration
Since combine_overall
isn't referenced anywhere else, we don't actually need to copy it over the intersect[itnext]
value - we could just reassign it without any aliasing issues. But better yet is to just leave it where it is. Instead of considering adjacent pairs of iteration
values that we merge together pairwise, we just merge everything into combine_overall
, one at a time (and set up an initial defaultdict
once instead of overwriting it).
This does mean we'll have to do some setup work - instead of special-casing the first merge, we'll "merge" intersect[0]
by itself into the initial state of combine_overall
.
combine_overall = defaultdict(list)
for k, v in intersect[0].items():
combine = defaultdict(list)
for key, value in v.keys():
combine[key] = [value]
combine_overall[k] = combine
for iteration in range(999):
itnext = iteration 1
for k, v in combine_overall.items():
combine = defaultdict(list)
for key in set(list(combine_overall[k].keys()) list(intersection[itnext][k].keys())):
combine[key] = combine_overall[k].get(key, [99999]) [intersect[itnext][k].get(key, 99999)]
combine_overall[k] = combine
Notice how much more simply we can do the initial step - we know which keys we're working with, so no .get
s are necessary; and there's only one dict, so no merging of key-sets is necessary. But we aren't done...
4. Some miscellaneous cleanup
Looking at this version, we can more easily notice:
- The
iteration
loop doesn't useiteration
at all, but onlyitnext
- so we can fix that. Also, there's no reason to use indexes like this for a simple loop - we should directly iterate over elements. combine_overall
will hold dicts, not lists (as we assign the values fromcombine
, which is adefaultdict
); sodefaultdict(list)
makes no sense.- Instead of using a temporary
combine
to build a replacement forcombine_overall[k]
and then assigning it back, we could just directly modifycombine_overall[k]
. In this way, we would actually get benefit from thedefaultdict
behaviour. We actually want the default values to be defaultdicts themselves - not completely straightforward, but very doable. - Since we no longer need to make a distinction between the overall combined result and individual results, we can rename
combine_overall
to justcombine
to look a little cleaner.
combine = defaultdict(lambda: defaultdict(list))
for k, v in intersect[0].items():
for key, value in v.keys():
combine[k][key] = [value]
for to_merge in intersect[1:]:
for k, v in combine.items():
for key in set(list(combine[k].keys()) list(to_merge[k].keys())):
combine[k][key] = combine[k].get(key, [99999]) [to_merge[k].get(key, 99999)]
5. Oops, there was a bug all along. Also, "special cases aren't special enough to break the rules"
Hopefully, this looks a little strange to you. Why are we using .get
on a defaultdict? Why would we have this single-item placeholder value, rather than an empty list? Why do we have to do this complicated check for the possible key
s to use? And do we really need to handle the first intersect
value differently?
Consider what happens on the following data (using the original naming conventions):
intersect_1 = {'1': {'1': 1}}
intersect_2 = {'1': {'1': 1}}
intersect_3 = {'1': {'1': 1, '2': 1}}
With the original approach, I get a result like:
$ python -i tmp.py
>>> intersect_3
defaultdict(<class 'list'>, {'1': defaultdict(<class 'list'>, {'2': [99999, 1], '1': [1, 1, 1]})})
Oops. intersect_3['1']['2']
only has two elements ([99999, 1]
), and thus doesn't match up with intersect_3['1']['1']
. That defeats the purpose of the 99999
placeholder values. The problem is that, because the value was missing multiple times at the start, multiple 99999
values should have been inserted, but only one was - the one that came from creating the initial list, when the isinstance
check reported an int
rather than a list, when it retrieved the 99999
with .get
. That lost information: we couldn't distinguish between a missing int and a missing list.
How do we work around this? Simple: we use the same, overall key set each time - the total set of keys that should be present, which we get from grid_density_original[k]
. Whenever one of those entries is missing in any of the intersect
results, we write the placeholder value instead. Now we are also handling each intersect
result the same way - instead of doing special setup with the first value, and merging everything else in, we are merging everything in to an empty initial state.
Instead of iterating over the .items
of combine
(and expecting to_merge
to have the same keys), we iterate over to_merge
, which makes a lot more sense. Instead of creating and assigning a list for combine[k][key]
, we simply append a value to the existing list (and we know there is one available, because we are using defaultdict
s properly now).
Thus:
combine = defaultdict(lambda: defaultdict(list))
for to_merge in intersect:
for k, v in to_merge.items():
# BTW, you typo'd this as "orignal" in the original code
for key in grid_density_original[k].keys():
combine[k][key].append(v.get(key, 99999))
(This does mean that, if none of the intersect
dicts contain a particular key
, the result will contain a list of 1000 99999
values, instead of omitting the key completely. I hope that isn't an issue.)
6. Okay, but weren't you going to do something about the memory usage?
Oh, right. Take a moment to write the corresponding code for the first loop, please.
Got it? Okay, now. Set up combine
first; and then each time you compute one of the would-be elements of intersect
, merge it in (using the two inner loops shown here) instead of building the actual intersect
list.
Oh, and I think you'll find that - since we're going to iterate over grid_density_original[k].keys()
anyway - the preprocessing to remove other keys from the g_den
results isn't actually needed at all now.