I have a table that shows the results of four colleagues trying to classify several objects as either a, b, c or d. If the colleagues were able to agree on the classification, or if only one colleague is able to classify the object, then in a new column I want to show the colleague's classification. If the colleagues disagree, I want to create a separate dataframe that displays those objects. For each object, at max only two colleagues are assigned to try classify it, so there won't be a situation where three colleagues cannot agree on the classification.
It is easy to show an object's classification if only one colleague is able to identify it, but I am struggling when there are two. I can only get as far as the following given my noob python skills.
The end result I am looking for, is 'a' for the first row, 'b' for third, and 'd' for fourth. The second row would be singled out for manual classification by a more experienced colleague.
df_test = pd.DataFrame({'check1':['a','a','unknown','d'],
'check2':['unknown','b','unknown','unknown'],
'check3':['unknown','unknown','c','d'],
'check4':['unknown','unknown','c','unknown']})
cols = ['check_ind','check1_ind','check2_ind','check3_ind','check4_ind']
for col in cols:
df_test[col] = 0
checks = [('check1','check1_ind'),('check2','check2_ind'),('check3','check3_ind'),('check4','check4_ind')]
rows = df_test.shape[0]
for r in range(rows):
for c in checks:
if df_test.iloc[r, df_test.columns.get_loc(c[0])] != 'unknown':
df_test.iloc[r, df_test.columns.get_loc(c[1])] = 1
sumcolumn = df_test['check1_ind'] df_test['check2_ind'] df_test['check3_ind'] df_test['check4_ind']
df_test['body_check'] = sumcolumn
CodePudding user response:
Something like this should do the trick:
def function(series):
val_counts = series.value_counts()
if val_counts.size > 1:
return 'No Consensus'
else:
return val_counts.index[0]
df_test.replace({'unknown': np.nan}).apply(function, axis=1)
CodePudding user response:
df.replace('unknown', np.nan, inplace=True)
df.apply(lambda x: x.dropna().unique()[0] if x.nunique() == 1 else 'No Consensus', axis=1)
Output:
0 a
1 No Consensus
2 c
3 d
dtype: object
In use:
df['consensus'] = df.apply(lambda x: x.dropna().unique()[0] if x.nunique() == 1 else np.nan, axis=1)
print(df)
...
check1 check2 check3 check4 consensus
0 a NaN NaN NaN a
1 a b NaN NaN NaN
2 NaN NaN c c c
3 d NaN d NaN d