Pandas - Build sequnce in the group while resetting and create new dataframe with summary-CodePudding

I got some help in the past and was able to advance well. Now I have an additional need to create a summary dataset for the study. Please help me if you can.

This is my current Dataset: import pandas as pd

df2 = pd.DataFrame({'patient': ['one', 'one', 'one', 'three','three', 'two','two','two','two'],    
                    'pattern': ['A', 'B', '000', 'C', 'A', '000','D','A','C'],        
                    'date': ['11/20/2022', '11/22/2022', '11/23/2022', '11/8/2022', '11/9/2022', '11/14/2022','11/20/2022', '11/22/2022', '11/23/2022']})


m = df2['pattern'] == '000'
df2['result'] = (df2[~m].groupby(['patient', m.cumsum()])
                        .cumcount().add(1)
                        .reindex(df2.index, fill_value=0))

df2

From the above current Dataset, I like to create a summary dataset like shown below. Can you help me how extract the below summary dataset from the above dataset, please?

required_dataset = pd.DataFrame({'pattern': ['A,B', 'C,A','D,A,C'],   ### Pattern happend by Date
                    'patients': [1,1,1]})                             ### Total Number of unique patients

required_dataset

CodePudding user response：

IIUC, you can use GroupBy.agg :

out = (
        df2[~m].groupby("patient")
                    .agg(pattern= ("pattern", ",".join),
                         patients=("patient", "nunique"))
                    .reset_index(drop=True)
      )

Output :

print(out)

  pattern  patients
0     A,B         1
1     C,A         1
2   D,A,C         1

CodePudding user response：

You can try doing this:

m = df2["pattern"] == "000"

summary_df = df2[~m].groupby("patient").agg(pattern = ("pattern", lambda x: ','.join(x)),
                                            patients = ("patient", "count")).reset_index()

summary_df

Prints:

    patient pattern   patients
0   one     A,B       2
1   three   C,A       2
2   two     D,A,C     3

The first pattern column concatenates all the patterns with ' ,'.join(x). The second patient column counts the number of visits.

Granted, the required_dataframe you provided had different values in this one, but I guess you meant visits. If not, let me know and I'll update my answer.