Home > Software design >  Matrix multiplication on a pandas dataframe
Matrix multiplication on a pandas dataframe

Time:01-26

I have the following dataframe

import numpy as np
import pandas as pd

data = [[30, 19, 6], [12, 23, 14], [8, 18, 20]]

df = pd.DataFrame(data = data, index = ['A', 'B', 'C'], columns = ['Bulgary', 'Robbery', 'Car Theft'])
df['r_Total'] = df.iloc[:, :3].sum(axis=1)
df.loc['c_Total'] = df.iloc[:3, :].sum(axis=0)
df
Bulgary Robbery Car Theft r_Total
A 30 19 6 55
B 12 23 14 49
C 8 18 20 46
c_Total 50 60 40 150

Now, I want to obtain expectation matrix E, where

E_{ij} = (ith_r_Total/Total)(jth_c_Total/Total)*Total

(eg: E_{00} = (55/150)*(50/150)*150.) How should I accomplish this?

Of course, there's always a brute force method to iterate over i and j in two for-loops, but is there a smarter way of doing it?

Mathematically, r_Total * c_Total / 150 would give me the answer I want. It really boils down to how to do matrix multiplication on dataframe. I tried:

df['r_Total'].dot(df.loc['c_Total'])

But got the following error:

ValueError: matrices are not aligned

I also tried:

df['r_Total'].mul(df.loc['c_Total'])

But got

A                NaN
B                NaN
Bulgary          NaN
C                NaN
Car Theft        NaN
Robbery          NaN
Total        22500.0
Name: Total, dtype: float64

But this is not my expected output. It is 1 col x 1 row, so my expected output should be a 3x3 matrix.

CodePudding user response:

To perform a matrix multiplication, the dimensions have to match. Since your expected output is of shape (3, 3), yet the two "matrices" are columns and rows, you'll have to make them matrices of shape (3, 1) and (1, 3) so that the multiplication would have the desired shape. You can do that by using double brackets ([[]]) when selecting.

To make the multiplication work, you'll have to convert at least one of the dataframes into a numpy array (that's what's happening with .values)

Also, to exclude the "_Total" rows, only select up to the last element.

Then divide by the last element in the dataframe which is 150 in your example.

res = df.loc[df.index[:-1], ['c_Total']].dot(df.loc[['r_Total']].values[:, :-1]).div(df.iat[-1,-1])

res

  • Related