Home > Software engineering >  How does Python data-frame sub-setting syntactically allow for boolean filtering within a df sub-sel
How does Python data-frame sub-setting syntactically allow for boolean filtering within a df sub-sel

Time:07-22

Apologies if this sounds quite basic but I'm trying to understand the deeper mechanics of subsetting syntax:

I understand that with non-.loc subsetting, you can select columns, rows by index number, and cross-select columns and rows-by-index-number.

But by what mechanism do you subset a series of booleans from a dataframe, using non.loc syntax? e.g.,

Working with this practice df:

enter image description here

you could write

test['age']==42

and get a series of booleans indicating where 42 appeared in the age column.

But when you write that same boolean filter as a subset of the same df

test[test['age']==42]

you get all the columns of the df, and full rows for any row that had 42 in the age column.

I'm wondering, more granularly, by what mechanism you subset a series of booleans from a df in this non-.loc context. Put differently, is this considered a row or column selection, or is it an entirely different mechanism that simply allows inputting a list/series/df of booleans?

It seems like you're selecting whether to show the follow rows depending on the True-False value of each row. And indeed, you could write

test[[True, False, True, False, False]]

to get the same result. But you'd get an error making the same direct row selection as a list, as via

test[[0,1,2,3,4]]

At bottom I'm trying to get a better understanding of the mechanism for such boolean-filtering, and how it might relate to non-.loc row/column selection.

CodePudding user response:

Your question asks:

by what mechanism do you subset a series of booleans from a dataframe, using non.loc syntax?

The pandas docs on Boolean indexing state:

You may select rows from a DataFrame using a boolean vector the same length as the DataFrame’s index (for example, something derived from one of the columns of the DataFrame)

You also write:

But you'd get an error making the same direct row selection as a list, as via test[[0,1,2,3,4]]

Such error behavior is a result of the fact that [] access with a list other than booleans expects column labels, not row index labels. This is made explicit in the Basics subsection of the Indexing and selecting data section of the pandas docs:

You can pass a list of columns to [] to select columns in that order. If a column is not contained in the DataFrame, an exception will be raised.

  • Related