Home > Net >  Subsetting Pandas dataframe based on Boolean condition - why doesn't order matter?
Subsetting Pandas dataframe based on Boolean condition - why doesn't order matter?

Time:05-26

My question is on Python pandas. Suppose we have pandas dataframe df with two columns "A" and "B". I subset to select all values of column "B" where the entries of column "A" are equal to "value". Why does the order of the following commands not matter?

df[df["A"]=="value"]["B"]

Returns the same as the code

df["B"][df["A"]=="value"]

Can you please explain why these work the same? Thanks.

CodePudding user response:

The df["A"]=="value" part of your code returns a pandas Series containing Boolean values in accordance to the condition ("A" == "value"). By puting a series mask (a filter, basically) on your DataFrame returns a DataFrame containining only the values on the rows where you've had True in your Series mask. So, in your first code ( df[df["A"]=="value"]["B"] ), you are applying the specific mask on the DataFrame, obtaining only the rows where the column "A" was equal to "value", then you are extracting the "B" column from your DataFrame. In your second code, you are first selecting the column "B", then you are selecting only the rows where the column "A" == "value" in the initial DataFrame. Hope this helps!

CodePudding user response:

The first form select rows then column than the second one select column then rows.

You can select rows and column simultaneously with loc:

df.loc[df['A'] == 'value', 'B']

Please read Evaluation order matters

CodePudding user response:

Imagine you have two sheets of standard-sized printer paper in portrait mode. You want to get a very particular rectangle whose top-left corner is in the exact center of a sheet of paper, with width 1cm and height 3cm.

First way:

Step #1: For the first sheet, you make two vertical cuts with a pair of scissors, one exactly in the middle and one a centimeter to the right of the middle. You discard the left and right pieces, keeping only the 1cm wide strip. This is conceptually similar to performing df["B"] on a dataframe to select only the Series that is column B.

Step #2: You then make two horizontal cuts with the pair of scissors, one exactly in the middle and one three centimeters below the first cut. You discard the top and bottom pieces, keeping only the 3cm high (and 1cm wide) rectangle. This is conceptually similar to starting with the Series df["B"] (lets call this series X) and then performing X[df["A"]=="value"] to obtain the rows within X that satisfy the logical condition df["A"]=="value".

Second way:

Step #1: For the second sheet, you make two horizontal cuts with a pair of scissors, one exactly in the middle and one three centimeters below the middle. You discard the top and bottom pieces, keeping only the 3cm high strip. This is conceptually similar to performing df[df["A"]=="value"] on a dataframe to select only the rows that satisfy the logical condition df["A"]=="value".

Step #2: You then make two vertical cuts with the pair of scissors, one exactly in the middle and one a centimeter to the right of the first cut. You discard the left and right pieces, keeping only the 1cm wide (and 3cm high) rectangle. This is conceptually similar to starting with the DateFrame df[df["A"]=="value"] (lets call this dataframe Y) and then performing Y["B"] to obtain the column "B" within Y.

Observation: The thought experiment above shows that whether we cut the paper vertically then horizontally, or instead cut it horizontally then vertically, we will end up with an identical rectangular result from exactly the same location (horizontally and vertically) in either case.

Conclusion: The intuition required to understand the answer to your question is completely analogous to the more tangible example using paper. One slight difference is that the rows selected by df[df["A"]=="value"] may not be contiguous, so rather than being analogous to a 3cm high slice of paper, they may be analogous to multiple parallel horizontal strips (i.e., multiple groups of contiguous rows).

  • Related