Home > Back-end >  Frequencies of combinaties of columns in Python
Frequencies of combinaties of columns in Python

Time:08-10

I have several columns that contain specific diseases. Here an example of a piece of it:

enter image description here

I want to make all possible combinations so I can check which combination of diseases mostly occur. So I want to make all combinations of 2 columns (A&B, A&C, A&D, B&C, B&D, C&D), but also combinations of 3 and 4 columns (A&B&C, B&C&D and so on). I have the following script for this:

from itertools import combinations

df.join(pd.concat({'_'.join(x): df[x[0]].str.cat(df[list(x[1:])].astype(str),
                                                 sep='')
                   for i in (2, 3, 4)
                   for x in combinations(df, i)}, axis=1))

But that generates a lot of extra columns in my dataset, and I still haven't got the frequencies of all combinations. This is the output that I would like to get:

enter image description here

What script can I use for this?

CodePudding user response:

Use DataFrame.stack with aggregate join and last count by Series.value_counts:

s = df.stack().groupby(level=0).agg(','.join).value_counts()
print (s)
artritis,asthma                         2
cancer,artritis,heart_failure,asthma    1
cancer,heart_failure                    1
dtype: int64   

If need 2 columns DataFrame:

df = s.rename_axis('vals').reset_index(name='count')
print (df)
                                   vals  count
0                       artritis,asthma      2
1  cancer,artritis,heart_failure,asthma      1
2                  cancer,heart_failure      1
    

CodePudding user response:

You can create a pivot table

def index_agg_fn(x):
    x = [e for e in x if e != '']
    return ','.join(x)

df = pd.DataFrame({'A': ['cancer', 'cancer', None, None], 
                  'B': ['artritis', None, 'artritis', 'artritis'], 
                  'C': ['heart_failure', 'heart_failure', None, None], 
                  'D': ['asthma', None, 'asthma', 'asthma']})
df['count'] = 1
ptable = pd.pivot_table(df.fillna(''), index=['A', 'B', 'C', 'D'], values=['count'], aggfunc='sum')
ptable.index = list(map(index_agg_fn, ptable.index))
print(ptable)

Result

                                         count
artritis,asthma                            2
cancer,heart_failure                       1
cancer,artritis,heart_failure,asthma       1
  • Related