Home > Mobile >  speeding up a double loop on pandas' date frame
speeding up a double loop on pandas' date frame

Time:10-04

I want to create a data frame (df_aug15_exp) based on another very large data frame (df_aug15). The idea is that for each element in the original data frame, i calculate the sum of the rows and columns of that element, multiply them together and divide them by the sum of the whole data frame, as shown below.

for h in header:
    total = df_aug15[h].sum().sum()
    for i in range(len(df_aug15[h])):
        for j in range(df_aug15[h].shape[1]):
            row_sum = df_aug15[h].iloc[i].sum()
            col_sum = df_aug15[h][j].sum()
            exp_val = (row_sum*col_sum)/total
            df_aug15_exp[h].iloc[i][j] = exp_val
        

The problem is that this method is very slow. Is there a better way to make things go in parallel to speed up the process? Thanks

CodePudding user response:

You could do something like this:

col_sum = df.sum(axis=0)
row_sum = df.sum(axis=1)

total_sum = col_sum.sum()

col_df = pd.DataFrame(col_sum)
row_df = pd.DataFrame(row_sum)

new_df = row_df.dot(col_df.T)
new_df = new_df / total_sum
  • Related