Home > Back-end >  difference between pandas df.query and using dataframe directly
difference between pandas df.query and using dataframe directly

Time:11-17

I have a question regarding the accuracy of two statements that I believe do the same job but surprisingly, I got different results!

(df.query('has_cancer == False')['test_result'] == 'Negative').mean()

> 0.79639570552147243

And

(df[df['has_cancer'] == False].test_result == 'Negative').sum() / df.shape[0]

> 0.71276595744680848

Why am getting a difference in the results and it's not a small difference!?

CodePudding user response:

Your error is that you are filtering the dataframe to has_cancer equal to False in the first statement.

Your second statement you are dividing by df.shape[0], which is not filtered.

df['has_cancer'].value_counts() shows:

False    2608
True      306
Name: has_cancer, dtype: int64

and,

(df[df['has_cancer'] == False].test_result == 'Negative').sum()

2077

So, 2077/2608 = 0.7963957055214724
and, 2077/(2608 306) = 0.7127659574468085

  • Related