How to count the occurances of elements in list in for a row in pandas-CodePudding

I have a df that looks like this. it is a multi-index df resulting from a group-by

grouped = df.groupby(['chromosome', 'start_pos', 'end_pos',
                      'observed']).agg(lambda x: x.tolist())

                                          reference         zygosity    
chromosome  start_pos   end_pos observed                                            
chr1            69428   69428       G       [T, T]          [hom, hom]      
                69511   69511       G       [A, A]          [hom, hom]      
                762273  762273      A       [G, G, G]       [hom, het, hom] 
                762589  762589      C       [G]             [hom]       
                762592  762592      G       [C]             [het]

For each row i want to count the number of het and hom in the zygosity. and make a new column called 'count_hom' and 'count_het'

I have tried using for loop it is slow and not very reliable with changing data. Is there a way to do this using something like df.zygosity.len().sum() but only for het or only for hom

CodePudding user response：

Instead of working on groupby result, you could adjust your groupby construction a bit by including a lambda to agg that counts "het" and "hom" values for each group at the time you build grouped:

grouped = (df.groupby(['chromosome', 'start_pos', 'end_pos','observed'])
           .agg(reference=('reference', list), 
                zygosity=('zygosity', list), 
                count_het=('zygosity', lambda x: x.eq('het').sum()),
                count_hom=('zygosity', lambda x: x.eq('hom').sum())))

CodePudding user response：

Use Series.apply with List count:

grouped['count_hom'] = grouped['zygosity'].apply(lambda x: x.count('hom'))
grouped['count_het'] = grouped['zygosity'].apply(lambda x: x.count('het'))

CodePudding user response：

You can dynamically count all possible values using explode groupby, then value_counts, then unstack:

new_df = pd.concat([df, df['zygosity'].explode().groupby(level=[0,1,2,3]).value_counts().unstack(level=4).fillna(0).add_prefix('count_').astype(int)], axis=1)

Output:

>>> new_df
                                       reference         zygosity  count_het  count_hom
chromosome start_pos end_pos observed                                                  
chr1       69428     69428   G            [T, T]       [hom, hom]          0          2
           69511     69511   G            [A, A]       [hom, hom]          0          2
           762273    762273  A         [G, G, G]  [hom, het, hom]          1          2
           762589    762589  C               [G]            [hom]          0          1
           762592    762592  G               [C]            [het]          1          0