I have data in which we have columns qid,docid, good, fair, bad
Good, fair and bad are grades from different graders
and I need to find qid values for which there exists only values that are bad.
qid docid good fair bad
2 3 2 3 0
2 5 1 0 4
3 2 0 0 4
3 4 0 0 3
In this case I want to find qid = 3
as the only row that has this qid has all rows with only "bad" results. I am unable to find an efficient way to do this operation
CodePudding user response:
Use a boolean mask:
>>> df[df[['good', 'fair']].eq(0).all(1) & df['bad'].gt(0)]
qid docid good fair bad
2 3 2 0 0 4
3 3 4 0 0 3
>>> df.loc[df[['good', 'fair']].eq(0).all(1) & df['bad'].gt(0), 'qid'] \
.unique().tolist()
[3]
CodePudding user response:
You can use an np.where() as below:
df['Check'] = np.where((df.qid == 3) & (df.bad > 0), True, False)
df[['qid', 'docid', 'good', 'fair', 'bad']].loc[df['Check'] == True]
CodePudding user response:
You can use the df.groupby
function to get a table with the sum of different grades occurences:
>> filtered_df = df.groupby(by=['qid']).sum().reset_index()
qid docid good fair bad
0 2 8 3 3 4
1 3 6 0 0 7
Then you can use df.loc
to find the rows with only bad grades:
>> filtered_df.loc[(filtered_df["good"] == 0) && (filtered_df["fair"] == 0)]
qid docid good fair bad
1 3 6 0 0 7