So I have a one-hot encoding called datBaskets. Rows are transactions, and columns are store departments. The ijth entry is true if the ith basket contains an item from the jth department and false otherwise. It looks something like this...
Cleaning Supplies | Batteries | Food | |
---|---|---|---|
0 | True | True | False |
1 | False | True | True |
2 | False | False | False |
3 | True | False | False |
4 | True | False | True |
5 | False | True | False |
What I seek is a matrix with departments as both rows and columns that tells me the proportion of all transactions that contain that combination of departments. So the Batteries-Food entry would be 0.167 in this case.
CodePudding user response:
Here is a vectorized solution, which is much faster than looping.
First, convert the boolean entries to integers (df.astype(int)
). Then compute the matrix multiplication between df
transposed (df.T
) and df
using the @
operator (df.T.dot(df)
works too). Finally, normalize the result to the number of transactions (len(df)
).
Optionally, round the result to 3 decimal places using the DataFrame.round
method.
df = df.astype(int)
res = ((df.T @ df) / len(df)).round(3)
>>> res
Cleaning Supplies Batteries Food
Cleaning Supplies 0.500 0.167 0.167
Batteries 0.167 0.500 0.167
Food 0.167 0.167 0.333
Intermediate Results
>>> df.astype(int)
Cleaning Supplies Batteries Food
0 1 1 0
1 0 1 1
2 0 0 0
3 1 0 0
4 1 0 1
5 0 1 0
>>> df.T
0 1 2 3 4 5
Cleaning Supplies 1 0 0 1 1 0
Batteries 1 1 0 0 0 1
Food 0 1 0 0 1 0
>>> df.T @ df
Cleaning Supplies Batteries Food
Cleaning Supplies 3 1 1
Batteries 1 3 1
Food 1 1 2
>>> (df.T @ df) / len(df)
Cleaning Supplies Batteries Food
Cleaning Supplies 0.500000 0.166667 0.166667
Batteries 0.166667 0.500000 0.166667
Food 0.166667 0.166667 0.333333
CodePudding user response:
from io import StringIO
import numpy as np
import pandas as pd
s = """
Cleaning Supplies,Batteries,Food
True,True,False
False,True,True
False,False,False
True,False,False
True,False,True
False,True,False
"""
# read in your data
df = pd.read_csv(StringIO(s))
ncols = len(df.columns)
matrix = np.zeros((ncols, ncols)) # initialize empty array
# go through each point in matrix and assign value (this will take the % of True, True combinations)
for i, col1 in enumerate(df.columns):
for j, col2 in enumerate(df.columns):
matrix[i, j] = (df[col1] * df[col2]).mean()
df_matrix = pd.DataFrame(matrix, columns=df.columns, index=df.columns) # create a data frame that labels indices and columns
Then df_matrix
will look like this:
Cleaning Supplies Batteries Food
Cleaning Supplies 0.500000 0.166667 0.166667
Batteries 0.166667 0.500000 0.166667
Food 0.166667 0.166667 0.333333