Home > database >  Count occurrence within a group using different columns for each group
Count occurrence within a group using different columns for each group

Time:01-06

In df below there are three groups in the variable 'group' - 'A', 'AB', 'C'. The other columns in the df is assigned to a specific group by suffix - var1_A relates to group A and so forth.

data = pd.DataFrame({'group':['A', 'AB', 'A', 'AB', 'AB', 'C', 'C', 'A', 'A', 'AB'],
                     'var1_A':['pass', 'fail', 'pass','fail', 'pass']*2,
                     'var2_A':['pass', 'pass', 'pass','fail', 'pass']*2,
                     'var1_AB':['pass', 'pass', 'pass','fail', 'pass']*2,
                     'var2_AB':['pass', 'pass', 'fail','fail', 'pass']*2,
                     'var1_C':['pass', 'pass', 'pass','fail', 'pass']*2,
                     'var2_C': ['fail', 'fail', 'fail','fail', 'pass']*2
                    })
            

I want for each row count the number of times 'pass' occur. For the instances that belongs to group A I only want to count the variables that are connected to the group A. I want the result in a new column. This would almost do the job.

data['new_col'] = data[data['group']=='A']['var1_A, var2_A].isin(['pass']).sum(1)
data['new_col'] = data[data['group']=='AB']['var1_AB, var2_AB].isin(['pass']).sum(1)
data['new_col'] = data[data['group']=='C']['var1_C, var2_C].isin(['pass']).sum(1)

However, I want the result in the same column from all groups. This operation is perhaps possible to do using a groupby and transform? However, I got stuck figuring it out.

Target dataframe:

pd.DataFrame({'group':['A', 'AB', 'A', 'AB', 'AB', 'C', 'C', 'A', 'A', 'AB'],
                     'var1_A':['pass', 'fail', 'pass','fail', 'pass']*2,
                     'var2_A':['pass', 'pass', 'pass','fail', 'pass']*2,
                     'var1_AB':['pass', 'pass', 'pass','fail', 'pass']*2,
                     'var2_AB':['pass', 'pass', 'fail','fail', 'pass']*2,
                     'var1_C':['pass', 'pass', 'pass','fail', 'pass']*2,
                     'var2_C': ['fail', 'fail', 'fail','fail', 'pass']*2,
                     'result':[2,2,2,0,2,1,1,2,0,2]
                    })
'''

CodePudding user response:

You can melt, filter and groupby.count:

data['result'] = (data
  .rename(columns=lambda x: x.split('_')[-1]) # get only part after "_"
  .reset_index().melt(['index', 'group'])
  # keep only identical groups and "pass" values
  .loc[lambda d: d['group'].eq(d['variable']) & d['value'].eq('pass')]
  .groupby('index')['value'].count()
  .reindex(data.index, fill_value=0)
)

print(data)

Or another approach using matrices and string comparisons:

df2 = data.set_index('group').eq('pass')
data['result'] = (df2.mul(df2.columns.str.extract('_(\w )', expand=False))
                     .eq(df2.index, axis=0).sum(axis=1)
                     .to_numpy()
                 )

Output:

  group var1_A var2_A var1_AB var2_AB var1_C var2_C  result
0     A   pass   pass    pass    pass   pass   fail       2
1    AB   fail   pass    pass    pass   pass   fail       2
2     A   pass   pass    pass    fail   pass   fail       2
3    AB   fail   fail    fail    fail   fail   fail       0
4    AB   pass   pass    pass    pass   pass   pass       2
5     C   pass   pass    pass    pass   pass   fail       1
6     C   fail   pass    pass    pass   pass   fail       1
7     A   pass   pass    pass    fail   pass   fail       2
8     A   fail   fail    fail    fail   fail   fail       0
9    AB   pass   pass    pass    pass   pass   pass       2
  • Related