Calculating pairwise overlap for multiple boolean columns in pandas dataframe-CodePudding

I have a pandas dataframe with multiple boolean columns. I would like to find the pairwise overlap between all these columns. The overlap should be something like the proportion of overlap between two columns excluding cases where both are zero. Like a jaccard score but I would like to exclude the cases where both elements are zero.

Dataframe example:

import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.binomial(1, 0.5, size=(100, 5)), columns=list('ABCDE'))
print(df.head())

   A  B  C  D  E
0  1  1  1  1  0
1  1  0  1  1  0
2  1  1  1  1  0
3  0  0  1  1  1
4  1  1  0  1  0

I would ideally like a solution like this (from this similar question How to compute jaccard similarity from a pandas dataframe) :

from sklearn.metrics.pairwise import pairwise_distances
jac_sim = pairwise_distances(df.T, metric = "jaccard")
jac_sim = pd.DataFrame(jac_sim, index=df.columns, columns=df.columns)

Just excluding the cases where both elements from two columns are False.

CodePudding user response：

Does something like this help?

df['AB'] = df['A']   df['B']
vcs = df['AB'].value_counts()
prop = vcs[2] / (vcs[1]   vcs[2]) # Two means overlap, 1 means no overlap

print(prop)

CodePudding user response：

One option is to call scipy.spatial.distance.cdist with your custom distance function:

from scipy.spatial.distance import cdist

def f(a, b):
  both_one = ((a & b) == 1).sum()
  different = (a != b).sum()
  return 1 - different / (different   both_one)

dists = pd.DataFrame(cdist(df.T, df.T, f), index=df.columns, columns=df.columns)
#           A         B         C         D         E
# A  1.000000  0.240000  0.380952  0.391892  0.260274
# B  0.240000  1.000000  0.323944  0.428571  0.320000
# C  0.380952  0.323944  1.000000  0.333333  0.328571
# D  0.391892  0.428571  0.333333  1.000000  0.362500
# E  0.260274  0.320000  0.328571  0.362500  1.000000