Improving pandas apply function performance-CodePudding

I have a pandas dataframe whose column contains dictionaries. I also have a query dictionary and I want to compute minimum sum of the values of the common keys.
For example

dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}
common keys = 'a', 'b'
s1 = dicta['a']   dicta['b']
s2 = dictb['a']   dictb['b']
result = min(s1, s2) = 2

I am using the following code to compute it.

def compute_common(dict1, dict2):

    common_keys = dict1.keys() & dict2.keys()
    im_count1 = sum((dict1[k] for k in common_keys))
    im_count2 = sum((dict2[k] for k in common_keys))
    return int(min(im_count1, im_count2))

Following are the timings for the operations on my i7 8 core machine with 8GB ram.

%timeit df['a'].apply(lambda x:compute_common(dictb, x))
55.2 ms ± 702 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I also found out that, I can use swifter to improve the performance of pandas apply(by using multiprocessing internally)

%timeit df['a'].swifter.progress_bar(False).apply(lambda x:compute_common(dictb, x))
66.4 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using swifter is even slower(maybe because of the overhead of multiprocessing). I wanted to know if there is any way to squeeze more performance out of this operation.

You can use the following to replicate things.

dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}
df = pd.DataFrame({'a': [dicta] * 30000})

%timeit df['a'].apply(lambda x:compute_common(dictb, x))
%timeit df['a'].swifter.progress_bar(False).apply(lambda x:compute_common(dictb, x))

Thanks in advance.

CodePudding user response：

use a list comprehension to find the values for the common keys then sum the list results finding the min between the two dictionary summed common key values. The common_keys are appended to a list creating ['a','b']. The list comprehension then finds the values for a and b and sums them equaling 26 and 2. The min of 26 and 2 is 2.

def find_common_keys(dicta, dictb):
     '''
     >>> find_common_keys({'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}, {'a': 1, 
     'b': 1, 't': 34, 'g': 56, 'h': 67})
      2
      '''
    common_keys = [key  for key in dicta if key in dictb]

    s1 = sum(dicta[key] for key in common_keys)
    s2 = sum(dictb[key] for key in common_keys)
    return min(s1, s2)

dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}

print(find_common_keys(dicta,dictb))

output

CodePudding user response：

You can explode the dictionaries to dataframes and sum them

dict_data = pd.DataFrame(df['a'].tolist())

common_keys = dict_data.columns.intersection(dictb.keys())

dictb_sum = sum(dictb[k] for k in common_keys)

dicta_sum = dict_data[common_keys].sum(1)

# also     
output = dicta_sum.clip(upper=dictb_sum)

This is twice as fast as apply on my system. Note that this works if union(x.keys() for x in df['a']) is not too big, since that all the columns of dict_data, but large enough so you can utilize the vectorized .sum(1).