Counting pair value across several columns?-CodePudding

Given the following table with one column storing a unique identifier (user_id column) and four binary columns (col1 to col_4_):

    import pandas as pd
    df = pd.DataFrame.from_dict({
        'id': ['a', 'b', 'c', 'd', 'e']
        ,'col1': [1,1,0,1,0]
        ,'col2': [0,1,1,1,1]
        ,'col3': [0,0,1,0,0]
        ,'col4': [0,0,1,1,1]
    })

user_id	col1	col2	col3	col4
a	1	0	0	0
b	1	1	0	0
c	0	1	1	1
d	1	1	0	1
e	0	1	0	1

How can I create an output table that shows how many user_ids had co-occurrence pairs of the binary columns?

co-occurrence pair	count of user_id
col1-col2	2
col1-col3	0
col1-col4	1
col2-col3	1
col2-col4	3
col3-col4	1

CodePudding user response：

You can try something like this:

from itertools import combinations
import pandas as pd

df = pd.DataFrame.from_dict({
        'id': ['a', 'b', 'c', 'd', 'e']
        ,'col1': [1,1,0,1,0]
        ,'col2': [0,1,1,1,1]
        ,'col3': [0,0,1,0,0]
        ,'col4': [0,0,1,1,1]
    })

dfi = df.set_index('id')

pd.Series({f'{z[0]}-{z[1]}':df[list(z)].all(1).sum() for z in list(combinations(dfi.columns, 2))})

Output:

col1-col2    2
col1-col3    0
col1-col4    1
col2-col3    1
col2-col4    3
col3-col4    1
dtype: int64

CodePudding user response：

df_output = pd.DataFrame({"co-occurrence pair": [f"{df.columns[i]}-{df.columns[j]}" for i in range(1, len(df.columns)) for j in range(i 1, len(df.columns))],
              "count of user_id": [sum(df.iloc[:, i] & df.iloc[:, j]) for i in range(1, len(df.columns)) for j in range(i 1, len(df.columns))]})

co-occurrence pair  count of user_id
0          col1-col2                 2
1          col1-col3                 0
2          col1-col4                 1
3          col2-col3                 1
4          col2-col4                 3
5          col3-col4                 1

More readable version

cols = []
counts = []
for i in range(1, len(df.columns)):
    for j in range(i   1, len(df.columns)):
        cols.append(f"{df.columns[i]}-{df.columns[j]}")
        counts.append(sum(df.iloc[:, i] & df.iloc[:, j]))
df_output = pd.DataFrame({"co-occurrence pair": cols, "count of user_id": counts})

CodePudding user response：

You can use itertools APIs for this.

Filter required columns:

import itertools

columns = list(itertools.filterfalse(lambda c: c=="id", df.columns))
>> ['col1', 'col2', 'col3', 'col4']

Create combinations of column pairs:

combinations = list(itertools.combinations(columns, 2))
>> [('col1', 'col2'),
>>  ('col1', 'col3'),
>>  ('col1', 'col4'),
>>  ('col2', 'col3'),
>>  ('col2', 'col4'),
>>  ('col3', 'col4')]

Filter binary columns:

result_data = [(f"{x}-{y}", df[(df[x]==1) & (df[y]==1)]["id"].count()) for x,y in combinations]

result_df = pd.DataFrame(result_data, columns=["co-occurrence pair", "count of user_id"])

>> [('col1-col2', 2),
>>  ('col1-col3', 0),
>>  ('col1-col4', 1),
>>  ('col2-col3', 1),
>>  ('col2-col4', 3),
>>  ('col3-col4', 1)]