How do you speed up a score calculation based on two rows in a Pandas Dataframe?-CodePudding

TLDR: How can one adjust the for-loop for a faster execution time:

import numpy as np
import pandas as pd
import time

np.random.seed(0)

# Given a DataFrame df and a row_index
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5

start = time.time()

target_row = df.loc[target_row_index]
result = []

# Method 1: Optimize this for-loop
for row in df.iterrows():
    """
    Logic of calculating the variables check and score: 
    if the values for a specific column are 2 for both rows (row/target_row), it should add 1 to the score
    if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
    """
    check = row[1] target_row  # row[1] takes 30 microseconds per call
    score = np.sum(check == 4) - np.sum(check == 3) # np.sum takes 47 microseconds per call
    result.append(score)

print(time.time()-start)

# Goal: Calculate the list result as efficient as possible

# Method 2: Optimize Apply
def add(a, b):
    check = a   b
    return np.sum(check == 4) - np.sum(check == 3)
    
start = time.time()
q = df.apply(lambda row : add(row, target_row), axis = 1)
print(time.time()-start)

So I have a dataframe of size 30'000 and a target row in this dataframe with a given row index. Now I want to compare this row to all the other rows in the dataset by calculating a score. The score is calculated as follows:

if the values for a specific column are 2 for both rows, it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.

The result is then the list of all the scores we just calculated.

As I need to execute this code quite often I would like to optimize it for performance. Any help is very much appreciated.

I already read Optimization when using Pandas are there further resources you can recommend? Thanks

CodePudding user response：

If you're willing to convert your df to a NumPy array, NumPy has some really good vectorisation that helps. My code using NumPy is as below:

df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5

start_time = time.time()

# Converting stuff to NumPy arrays
target_row = df.loc[target_row_index].to_numpy()
np_arr = df.to_numpy()

# Calculations
np_arr  = target_row
check = np.sum(np_arr == 4, axis=1) - np.sum(np_arr == 3, axis=1)
result = list(check)

end_time = time.time()
print(end_time - start_time)

Your complete code (on Google Colab for me) outputs a time of 14.875332832336426 s, while the NumPy code above outputs a time of 0.018691539764404297 s, and of course, the result list is the same in both cases.

Note that in general, if your calculations are purely numerical, NumPy will virtually always be better than Pandas and a for loop. Pandas really shines through with strings and when you need the column and row names, but for pure numbers, NumPy is the way to go due to vectorisation.