count number of unique combinations pandas data frame-CodePudding

I have trouble creating a summary statistic for my data. My dataframe looks like this

id   status
a    approved
a    approved
b    draft
b    redraft
c    redraft
c    draft
d    approved
d    draft

outcome:

status_combo       id_count
approved,approved  1
draft,redraft      2
approved,draft     1

The code I'm using

df1=df.groupby('id')['status'].apply(tuple).rename('status_combo')
df2=df1.groupby(df1).size().reset_index(name='id_count')
print(df2)

create all combination of status where the order of status is included too however for me the desired outcome should treat draft,redraft and redraft,draft as one type of status_combo Please advise. Thanks

CodePudding user response：

You can try sort the column before making it into tuple

df1 = df.groupby('id')['status'].apply(lambda x: tuple(sorted(x))).rename('status_combo')
df2 = df1.groupby(df1).size().reset_index(name='id_count')

print(df2)

           status_combo  id_count
0  (approved, approved)         1
1     (approved, draft)         1
2      (draft, redraft)         2

CodePudding user response：

You can do :

df = df.groupby('id',as_index=False).agg(
    status_approved=('status',lambda x:','.join(sorted(tuple(x))))).groupby(
    'status_approved', as_index=False).agg(id_count=('id', 'count'))

print(df):

     status_approved  id_count
0  approved,approved         1
1     approved,draft         1
2      draft,redraft         2

CodePudding user response：

Can be done with a simple oneliner:

df.groupby('id').agg(set).reset_index().status.value_counts()

Result:

{redraft, draft}     2
{approved}           1
{draft, approved}    1
Name: status, dtype: int64

Or in your solution add sort_values('status'):

df1=df.sort_values('status').groupby('id')['status'].apply(tuple).rename('status_combo')
df2=df1.groupby(df1).size().reset_index(name='id_count')
print(df2)

Result:

           status_combo  id_count
0  (approved, approved)         1
1     (approved, draft)         1
2      (draft, redraft)         2