How to speed up a pandas loop?-CodePudding

I have a pandas dataframe named 'matrix', it looks like this:

    antecedent_sku consequent_sku similarity
0   001            002            0.3
1   001            003            0.2
2   001            004            0.1
3   001            005            0.4
4   002            001            0.4
5   002            003            0.5
6   002            004            0.1

Out of this dataframe I want to create a similarity matrix for further clustering. I do it in two steps.

Step 1: to create an empty similarity matrix ('similarity')

set_name = set(matrix['antecedent_sku'].values) 
similarity = pd.DataFrame(index = list(set_name), columns = list(set_name))

Step 2: to fill it with values from 'matrix':

for ind in tqdm(list(similarity.index)):
        
     for col in list(similarity.columns):
            
            if ind==col:
                similarity.loc[ind, col] = 1
                
            elif len(matrix.loc[(matrix['antecedent_sku'].values==f'{ind}') & (matrix['consequent_sku'].values==f'{col}'), 'similarity'].values) < 1:
                similarity.loc[ind, col] = 0
                
            else:
                similarity.loc[ind, col] = matrix.loc[(matrix['antecedent_sku'].values==f'{ind}') & (matrix['consequent_sku'].values==f'{col}'), 'similarity'].values[0]

The problem: it takes 4 hours to fill a matrix of shape (3000,3000).

The question: what am I doing wrong? Should I aim at speeding up the code with something like Cython/Numba or the problem lies in the archetecture of my approach and I should use built-in functions or some other clever way to transform 'matrix' into 'similarity' instead of a double loop?

P.S. I run Python 3.8.7

CodePudding user response：

Iterating over pandas dataframe using loc is known to be very slow. The CPython interpreter is also known to be slow too (typically loops). Every pandas operation have a high overhead. However, the main point is that you iterate over 3000x3000 elements so to call for each element things like matrix['antecedent_sku'].values==f'{ind}' which certainly iterate over 3000 items that are strings also known to be an inefficient datatype (since the processor need to parse a variable-length UTF-8 sequence of multiple characters). Since this is done twice per iteration and you parse a new integer for each comparison, this means 3000*3000*3000*2 = 54_000_000_000 string comparisons will be performed, with overall 3000*3000*3000*2*2*3 = 324_000_000_000 characters to (inefficiently) compare! There is no chance this can be fast since this is very inefficient. Not to mention every of the 9_000_000 iterations creates/delete several temporary arrays and Pandas objects.

The first thing to do is to reduce the number of recomputed operations thanks to some precomputations. Indeed, you can store the values of matrix['antecedent_sku'].values==f'{ind}' (as Numpy arrays since pandas series are inefficient) in a dictionary indexed by ind so to fetch it faster in the loop. This should make this part 3000 time faster (since there should be only 3000 items). Even better: you can use a groupby to do that more efficiently.

Moreover, you can convert the columns to integers (ie. antecedent_sku and consequent_sku) so to avoid many expensive string comparisons.

Then you can remove useless operations like the matrix.loc[..., 'similarity'].values. Indeed, since you just want to know the length of the result, you can just use np.sum of the binary numpy array. In fact, you can even use np.any since you check if the length is less than 1.

Then you can avoid the creation of temporary Numpy array with a preallocated buffer and by specifying the output buffer in Numpy operations. For example, you can use np.logical_and(A, B, out=your_preallocated_buffer) instead of just a A & B.

Finally, if (and only if) all the previous steps are not enough to make the overall computation hundreds or thousands time faster, you can use Numba by converting your dataframe to a Numpy array first (since Numba does not support dataframe). If this is still not enough, you can use prange (instead of range) and the flag parallel=True of Numba so to use multiple threads.

Please note that Pandas is not really design to manipulate dataframes of 3000 columns and will certainly not be very fast because of that. Numpy is better suited for manipulating matrices.

CodePudding user response：

Following Jerome's lead with a dictionary, I've done the following:

Step 1: to create a dictionary

matrix_dict = matrix.copy()
matrix_dict = matrix_dict.set_index(['antecedent_sku', 'consequent_sku'])['similarity'].to_dict()

matrix_dict looks like this:

{(001, 002): 0.3}

Step 2: to fill similarity with values from matrix_dict

for ind in tqdm(list(similarity.index)):
        
     for col in list(similarity.columns):
            
            if ind==col:
                similarity.loc[ind, col] = 1
                
            else:

                similarity.loc[ind, col] = matrix_dict.get((int(ind), int(col)))

Step 3: fillna with zeroes

similarity = similarity.fillna(0)

Result: x35 performance (4 hours 20 minutes to 7 minutes)