Support Matrix for Boolean DataFrame-CodePudding

So I have a one-hot encoding called datBaskets. Rows are transactions, and columns are store departments. The ijth entry is true if the ith basket contains an item from the jth department and false otherwise. It looks something like this...

	Cleaning Supplies	Batteries	Food
0	True	True	False
1	False	True	True
2	False	False	False
3	True	False	False
4	True	False	True
5	False	True	False

What I seek is a matrix with departments as both rows and columns that tells me the proportion of all transactions that contain that combination of departments. So the Batteries-Food entry would be 0.167 in this case.

CodePudding user response：

Here is a vectorized solution, which is much faster than looping.

First, convert the boolean entries to integers (df.astype(int)). Then compute the matrix multiplication between df transposed (df.T) and df using the @ operator (df.T.dot(df) works too). Finally, normalize the result to the number of transactions (len(df)).

Optionally, round the result to 3 decimal places using the DataFrame.round method.

df = df.astype(int)
res = ((df.T @ df) / len(df)).round(3)

>>> res

                   Cleaning Supplies  Batteries   Food
Cleaning Supplies              0.500      0.167  0.167
Batteries                      0.167      0.500  0.167
Food                           0.167      0.167  0.333

Intermediate Results

>>> df.astype(int)

   Cleaning Supplies  Batteries  Food
0                  1          1     0
1                  0          1     1
2                  0          0     0
3                  1          0     0
4                  1          0     1
5                  0          1     0

>>> df.T

                   0  1  2  3  4  5
Cleaning Supplies  1  0  0  1  1  0
Batteries          1  1  0  0  0  1
Food               0  1  0  0  1  0

>>> df.T @ df

                   Cleaning Supplies  Batteries  Food
Cleaning Supplies                  3          1     1
Batteries                          1          3     1
Food                               1          1     2

>>> (df.T @ df) / len(df)

                   Cleaning Supplies  Batteries      Food
Cleaning Supplies           0.500000   0.166667  0.166667
Batteries                   0.166667   0.500000  0.166667
Food                        0.166667   0.166667  0.333333

CodePudding user response：

from io import StringIO

import numpy as np
import pandas as pd

s = """
Cleaning Supplies,Batteries,Food
True,True,False
False,True,True
False,False,False
True,False,False
True,False,True
False,True,False
"""

# read in your data
df = pd.read_csv(StringIO(s))

ncols = len(df.columns)
matrix = np.zeros((ncols, ncols))  # initialize empty array

# go through each point in matrix and assign value (this will take the % of True, True combinations)
for i, col1 in enumerate(df.columns):
    for j, col2 in enumerate(df.columns):
        matrix[i, j] = (df[col1] * df[col2]).mean()

df_matrix = pd.DataFrame(matrix, columns=df.columns, index=df.columns)  # create a data frame that labels indices and columns

Then df_matrix will look like this:

                    Cleaning Supplies   Batteries   Food
Cleaning Supplies   0.500000            0.166667    0.166667
Batteries           0.166667            0.500000    0.166667
Food                0.166667            0.166667    0.333333