I have a df that looks like this. it is a multi-index df resulting from a group-by
grouped = df.groupby(['chromosome', 'start_pos', 'end_pos',
'observed']).agg(lambda x: x.tolist())
reference zygosity
chromosome start_pos end_pos observed
chr1 69428 69428 G [T, T] [hom, hom]
69511 69511 G [A, A] [hom, hom]
762273 762273 A [G, G, G] [hom, het, hom]
762589 762589 C [G] [hom]
762592 762592 G [C] [het]
For each row i want to count the number of het and hom in the zygosity. and make a new column called 'count_hom' and 'count_het'
I have tried using for loop it is slow and not very reliable with changing data. Is there a way to do this using something like df.zygosity.len().sum() but only for het or only for hom
CodePudding user response:
Instead of working on groupby result, you could adjust your groupby
construction a bit by including a lambda to agg
that counts "het" and "hom" values for each group at the time you build grouped
:
grouped = (df.groupby(['chromosome', 'start_pos', 'end_pos','observed'])
.agg(reference=('reference', list),
zygosity=('zygosity', list),
count_het=('zygosity', lambda x: x.eq('het').sum()),
count_hom=('zygosity', lambda x: x.eq('hom').sum())))
CodePudding user response:
Use Series.apply
with List count
:
grouped['count_hom'] = grouped['zygosity'].apply(lambda x: x.count('hom'))
grouped['count_het'] = grouped['zygosity'].apply(lambda x: x.count('het'))
CodePudding user response:
You can dynamically count all possible values using explode
groupby
, then value_counts
, then unstack
:
new_df = pd.concat([df, df['zygosity'].explode().groupby(level=[0,1,2,3]).value_counts().unstack(level=4).fillna(0).add_prefix('count_').astype(int)], axis=1)
Output:
>>> new_df
reference zygosity count_het count_hom
chromosome start_pos end_pos observed
chr1 69428 69428 G [T, T] [hom, hom] 0 2
69511 69511 G [A, A] [hom, hom] 0 2
762273 762273 A [G, G, G] [hom, het, hom] 1 2
762589 762589 C [G] [hom] 0 1
762592 762592 G [C] [het] 1 0