I want to create a data frame (df_aug15_exp) based on another very large data frame (df_aug15). The idea is that for each element in the original data frame, i calculate the sum of the rows and columns of that element, multiply them together and divide them by the sum of the whole data frame, as shown below.
for h in header:
total = df_aug15[h].sum().sum()
for i in range(len(df_aug15[h])):
for j in range(df_aug15[h].shape[1]):
row_sum = df_aug15[h].iloc[i].sum()
col_sum = df_aug15[h][j].sum()
exp_val = (row_sum*col_sum)/total
df_aug15_exp[h].iloc[i][j] = exp_val
The problem is that this method is very slow. Is there a better way to make things go in parallel to speed up the process? Thanks
CodePudding user response:
You could do something like this:
col_sum = df.sum(axis=0)
row_sum = df.sum(axis=1)
total_sum = col_sum.sum()
col_df = pd.DataFrame(col_sum)
row_df = pd.DataFrame(row_sum)
new_df = row_df.dot(col_df.T)
new_df = new_df / total_sum