I have a pandas dataframe whose column contains dictionaries. I also have a query dictionary and I want to compute minimum sum of the values of the common keys.
For example
dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}
common keys = 'a', 'b'
s1 = dicta['a'] dicta['b']
s2 = dictb['a'] dictb['b']
result = min(s1, s2) = 2
I am using the following code to compute it.
def compute_common(dict1, dict2):
common_keys = dict1.keys() & dict2.keys()
im_count1 = sum((dict1[k] for k in common_keys))
im_count2 = sum((dict2[k] for k in common_keys))
return int(min(im_count1, im_count2))
Following are the timings for the operations on my i7 8 core machine with 8GB ram.
%timeit df['a'].apply(lambda x:compute_common(dictb, x))
55.2 ms ± 702 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I also found out that, I can use swifter to improve the performance of pandas apply(by using multiprocessing internally)
%timeit df['a'].swifter.progress_bar(False).apply(lambda x:compute_common(dictb, x))
66.4 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using swifter is even slower(maybe because of the overhead of multiprocessing). I wanted to know if there is any way to squeeze more performance out of this operation.
You can use the following to replicate things.
dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}
df = pd.DataFrame({'a': [dicta] * 30000})
%timeit df['a'].apply(lambda x:compute_common(dictb, x))
%timeit df['a'].swifter.progress_bar(False).apply(lambda x:compute_common(dictb, x))
Thanks in advance.
CodePudding user response:
use a list comprehension to find the values for the common keys then sum the list results finding the min between the two dictionary summed common key values. The common_keys are appended to a list creating ['a','b']. The list comprehension then finds the values for a and b and sums them equaling 26 and 2. The min of 26 and 2 is 2.
def find_common_keys(dicta, dictb):
'''
>>> find_common_keys({'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}, {'a': 1,
'b': 1, 't': 34, 'g': 56, 'h': 67})
2
'''
common_keys = [key for key in dicta if key in dictb]
s1 = sum(dicta[key] for key in common_keys)
s2 = sum(dictb[key] for key in common_keys)
return min(s1, s2)
dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}
print(find_common_keys(dicta,dictb))
output
2
CodePudding user response:
You can explode the dictionaries to dataframes and sum them
dict_data = pd.DataFrame(df['a'].tolist())
common_keys = dict_data.columns.intersection(dictb.keys())
dictb_sum = sum(dictb[k] for k in common_keys)
dicta_sum = dict_data[common_keys].sum(1)
# also
output = dicta_sum.clip(upper=dictb_sum)
This is twice as fast as apply
on my system. Note that this works if union(x.keys() for x in df['a'])
is not too big, since that all the columns of dict_data
, but large enough so you can utilize the vectorized .sum(1)
.