In df below there are three groups in the variable 'group' - 'A', 'AB', 'C'. The other columns in the df is assigned to a specific group by suffix - var1_A relates to group A and so forth.
data = pd.DataFrame({'group':['A', 'AB', 'A', 'AB', 'AB', 'C', 'C', 'A', 'A', 'AB'],
'var1_A':['pass', 'fail', 'pass','fail', 'pass']*2,
'var2_A':['pass', 'pass', 'pass','fail', 'pass']*2,
'var1_AB':['pass', 'pass', 'pass','fail', 'pass']*2,
'var2_AB':['pass', 'pass', 'fail','fail', 'pass']*2,
'var1_C':['pass', 'pass', 'pass','fail', 'pass']*2,
'var2_C': ['fail', 'fail', 'fail','fail', 'pass']*2
})
I want for each row count the number of times 'pass' occur. For the instances that belongs to group A I only want to count the variables that are connected to the group A. I want the result in a new column. This would almost do the job.
data['new_col'] = data[data['group']=='A']['var1_A, var2_A].isin(['pass']).sum(1)
data['new_col'] = data[data['group']=='AB']['var1_AB, var2_AB].isin(['pass']).sum(1)
data['new_col'] = data[data['group']=='C']['var1_C, var2_C].isin(['pass']).sum(1)
However, I want the result in the same column from all groups. This operation is perhaps possible to do using a groupby and transform? However, I got stuck figuring it out.
Target dataframe:
pd.DataFrame({'group':['A', 'AB', 'A', 'AB', 'AB', 'C', 'C', 'A', 'A', 'AB'],
'var1_A':['pass', 'fail', 'pass','fail', 'pass']*2,
'var2_A':['pass', 'pass', 'pass','fail', 'pass']*2,
'var1_AB':['pass', 'pass', 'pass','fail', 'pass']*2,
'var2_AB':['pass', 'pass', 'fail','fail', 'pass']*2,
'var1_C':['pass', 'pass', 'pass','fail', 'pass']*2,
'var2_C': ['fail', 'fail', 'fail','fail', 'pass']*2,
'result':[2,2,2,0,2,1,1,2,0,2]
})
'''
CodePudding user response:
You can melt
, filter and groupby.count
:
data['result'] = (data
.rename(columns=lambda x: x.split('_')[-1]) # get only part after "_"
.reset_index().melt(['index', 'group'])
# keep only identical groups and "pass" values
.loc[lambda d: d['group'].eq(d['variable']) & d['value'].eq('pass')]
.groupby('index')['value'].count()
.reindex(data.index, fill_value=0)
)
print(data)
Or another approach using matrices and string comparisons:
df2 = data.set_index('group').eq('pass')
data['result'] = (df2.mul(df2.columns.str.extract('_(\w )', expand=False))
.eq(df2.index, axis=0).sum(axis=1)
.to_numpy()
)
Output:
group var1_A var2_A var1_AB var2_AB var1_C var2_C result
0 A pass pass pass pass pass fail 2
1 AB fail pass pass pass pass fail 2
2 A pass pass pass fail pass fail 2
3 AB fail fail fail fail fail fail 0
4 AB pass pass pass pass pass pass 2
5 C pass pass pass pass pass fail 1
6 C fail pass pass pass pass fail 1
7 A pass pass pass fail pass fail 2
8 A fail fail fail fail fail fail 0
9 AB pass pass pass pass pass pass 2