I have a question regarding the accuracy of two statements that I believe do the same job but surprisingly, I got different results!
(df.query('has_cancer == False')['test_result'] == 'Negative').mean()
> 0.79639570552147243
And
(df[df['has_cancer'] == False].test_result == 'Negative').sum() / df.shape[0]
> 0.71276595744680848
Why am getting a difference in the results and it's not a small difference!?
CodePudding user response:
Your error is that you are filtering the dataframe to has_cancer equal to False in the first statement.
Your second statement you are dividing by df.shape[0], which is not filtered.
df['has_cancer'].value_counts() shows:
False 2608
True 306
Name: has_cancer, dtype: int64
and,
(df[df['has_cancer'] == False].test_result == 'Negative').sum()
2077
So, 2077/2608 = 0.7963957055214724
and, 2077/(2608 306) = 0.7127659574468085