Given the following table with one column storing a unique identifier (user_id column) and four binary columns (col1 to col_4_):
import pandas as pd df = pd.DataFrame.from_dict({ 'id': ['a', 'b', 'c', 'd', 'e'] ,'col1': [1,1,0,1,0] ,'col2': [0,1,1,1,1] ,'col3': [0,0,1,0,0] ,'col4': [0,0,1,1,1] })
user_id | col1 | col2 | col3 | col4 |
---|---|---|---|---|
a | 1 | 0 | 0 | 0 |
b | 1 | 1 | 0 | 0 |
c | 0 | 1 | 1 | 1 |
d | 1 | 1 | 0 | 1 |
e | 0 | 1 | 0 | 1 |
How can I create an output table that shows how many user_ids had co-occurrence pairs of the binary columns?
co-occurrence pair | count of user_id |
---|---|
col1-col2 | 2 |
col1-col3 | 0 |
col1-col4 | 1 |
col2-col3 | 1 |
col2-col4 | 3 |
col3-col4 | 1 |
CodePudding user response:
You can try something like this:
from itertools import combinations
import pandas as pd
df = pd.DataFrame.from_dict({
'id': ['a', 'b', 'c', 'd', 'e']
,'col1': [1,1,0,1,0]
,'col2': [0,1,1,1,1]
,'col3': [0,0,1,0,0]
,'col4': [0,0,1,1,1]
})
dfi = df.set_index('id')
pd.Series({f'{z[0]}-{z[1]}':df[list(z)].all(1).sum() for z in list(combinations(dfi.columns, 2))})
Output:
col1-col2 2
col1-col3 0
col1-col4 1
col2-col3 1
col2-col4 3
col3-col4 1
dtype: int64
CodePudding user response:
df_output = pd.DataFrame({"co-occurrence pair": [f"{df.columns[i]}-{df.columns[j]}" for i in range(1, len(df.columns)) for j in range(i 1, len(df.columns))],
"count of user_id": [sum(df.iloc[:, i] & df.iloc[:, j]) for i in range(1, len(df.columns)) for j in range(i 1, len(df.columns))]})
co-occurrence pair count of user_id
0 col1-col2 2
1 col1-col3 0
2 col1-col4 1
3 col2-col3 1
4 col2-col4 3
5 col3-col4 1
More readable version
cols = []
counts = []
for i in range(1, len(df.columns)):
for j in range(i 1, len(df.columns)):
cols.append(f"{df.columns[i]}-{df.columns[j]}")
counts.append(sum(df.iloc[:, i] & df.iloc[:, j]))
df_output = pd.DataFrame({"co-occurrence pair": cols, "count of user_id": counts})
CodePudding user response:
You can use itertools APIs for this.
Filter required columns:
import itertools
columns = list(itertools.filterfalse(lambda c: c=="id", df.columns))
>> ['col1', 'col2', 'col3', 'col4']
Create combinations of column pairs:
combinations = list(itertools.combinations(columns, 2))
>> [('col1', 'col2'),
>> ('col1', 'col3'),
>> ('col1', 'col4'),
>> ('col2', 'col3'),
>> ('col2', 'col4'),
>> ('col3', 'col4')]
Filter binary columns:
result_data = [(f"{x}-{y}", df[(df[x]==1) & (df[y]==1)]["id"].count()) for x,y in combinations]
result_df = pd.DataFrame(result_data, columns=["co-occurrence pair", "count of user_id"])
>> [('col1-col2', 2),
>> ('col1-col3', 0),
>> ('col1-col4', 1),
>> ('col2-col3', 1),
>> ('col2-col4', 3),
>> ('col3-col4', 1)]