I have trouble creating a summary statistic for my data. My dataframe looks like this
id status
a approved
a approved
b draft
b redraft
c redraft
c draft
d approved
d draft
outcome:
status_combo id_count
approved,approved 1
draft,redraft 2
approved,draft 1
The code I'm using
df1=df.groupby('id')['status'].apply(tuple).rename('status_combo')
df2=df1.groupby(df1).size().reset_index(name='id_count')
print(df2)
create all combination of status where the order of status is included too however for me the desired outcome should treat draft,redraft and redraft,draft as one type of status_combo Please advise. Thanks
CodePudding user response:
You can try sort
the column before making it into tuple
df1 = df.groupby('id')['status'].apply(lambda x: tuple(sorted(x))).rename('status_combo')
df2 = df1.groupby(df1).size().reset_index(name='id_count')
print(df2)
status_combo id_count
0 (approved, approved) 1
1 (approved, draft) 1
2 (draft, redraft) 2
CodePudding user response:
You can do :
df = df.groupby('id',as_index=False).agg(
status_approved=('status',lambda x:','.join(sorted(tuple(x))))).groupby(
'status_approved', as_index=False).agg(id_count=('id', 'count'))
print(df):
status_approved id_count
0 approved,approved 1
1 approved,draft 1
2 draft,redraft 2
CodePudding user response:
Can be done with a simple oneliner:
df.groupby('id').agg(set).reset_index().status.value_counts()
Result:
{redraft, draft} 2
{approved} 1
{draft, approved} 1
Name: status, dtype: int64
Or in your solution add sort_values('status')
:
df1=df.sort_values('status').groupby('id')['status'].apply(tuple).rename('status_combo')
df2=df1.groupby(df1).size().reset_index(name='id_count')
print(df2)
Result:
status_combo id_count
0 (approved, approved) 1
1 (approved, draft) 1
2 (draft, redraft) 2