Consider the following dataframe:
data = [[1, 2, 3, 4], [4, 3, 2, 1]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
What would be the most efficient way to generate an expected frequency table? i.e. for each cell value compute the result of (row total * column total) / (total sum)
So that the final dataframe is:
data = [[2.5, 2.5, 2.5, 2.5], [2.5, 2.5, 2.5, 2.5]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
CodePudding user response:
You can use the underlying numpy array and broadcasting:
a = df.values
pd.DataFrame((a.sum(0)*a.sum(1)[:,None])/a.sum(),
columns=df.columns, index=df.index)
output:
A B C D
0 2.5 2.5 2.5 2.5
1 2.5 2.5 2.5 2.5