Home > Software design >  Data problem: identifying data rows where colleagues have reached a consensus
Data problem: identifying data rows where colleagues have reached a consensus

Time:05-15

I have a table that shows the results of four colleagues trying to classify several objects as either a, b, c or d. If the colleagues were able to agree on the classification, or if only one colleague is able to classify the object, then in a new column I want to show the colleague's classification. If the colleagues disagree, I want to create a separate dataframe that displays those objects. For each object, at max only two colleagues are assigned to try classify it, so there won't be a situation where three colleagues cannot agree on the classification.

It is easy to show an object's classification if only one colleague is able to identify it, but I am struggling when there are two. I can only get as far as the following given my noob python skills.

The end result I am looking for, is 'a' for the first row, 'b' for third, and 'd' for fourth. The second row would be singled out for manual classification by a more experienced colleague.

df_test = pd.DataFrame({'check1':['a','a','unknown','d'],
                        'check2':['unknown','b','unknown','unknown'],
                        'check3':['unknown','unknown','c','d'],
                        'check4':['unknown','unknown','c','unknown']})

cols = ['check_ind','check1_ind','check2_ind','check3_ind','check4_ind']
for col in cols:        
    df_test[col] = 0
checks = [('check1','check1_ind'),('check2','check2_ind'),('check3','check3_ind'),('check4','check4_ind')]
rows = df_test.shape[0]
for r in range(rows):
    for c in checks:
        if df_test.iloc[r, df_test.columns.get_loc(c[0])] != 'unknown':
            df_test.iloc[r, df_test.columns.get_loc(c[1])] = 1
sumcolumn = df_test['check1_ind']   df_test['check2_ind']   df_test['check3_ind']   df_test['check4_ind']
df_test['body_check'] = sumcolumn

CodePudding user response:

Something like this should do the trick:

def function(series):
    val_counts = series.value_counts()
    if val_counts.size > 1:
        return 'No Consensus'
    else: 
        return val_counts.index[0]
    

df_test.replace({'unknown': np.nan}).apply(function, axis=1)

CodePudding user response:

df.replace('unknown', np.nan, inplace=True)
df.apply(lambda x: x.dropna().unique()[0] if x.nunique() == 1 else 'No Consensus', axis=1)

Output:

0               a
1    No Consensus
2               c
3               d
dtype: object

In use:

df['consensus'] = df.apply(lambda x: x.dropna().unique()[0] if x.nunique() == 1 else np.nan, axis=1)
print(df)

...

  check1 check2 check3 check4 consensus
0      a    NaN    NaN    NaN         a
1      a      b    NaN    NaN       NaN
2    NaN    NaN      c      c         c
3      d    NaN      d    NaN         d
  • Related