Home > Mobile >  In pandas, how can I filter for rows where ALL values are higher than a certain threshold? And keep
In pandas, how can I filter for rows where ALL values are higher than a certain threshold? And keep

Time:07-16

In pandas, how can I filter for rows where ALL values are higher than a certain threshold?

Say I have a table that looks as follows:

City Bird species one Bird species two Bird Species three Bird species four
A 7 11 13 16
B 11 12 13 14
C 20 21 22 23
D 8 6 4 5

Now I only want to get rows that have ALL COUNTS greater than 10. Here that would be Row B and Row C.

So my desired output is:

City Bird species one Bird species two Bird Species three Bird species four
B 11 12 13 14
C 20 21 22 23

So, even if a single values is false I want that row dropped. Take for example in the example table, Row A has only one value less than 10 but it is dropped.

I tried doing this with df.iloc[:,1:] >= 10 which creates a boolean table and if I do df[df.iloc[:,1:] >= 10] it gives me table that shows which cells are satisfying the condition but since the first column is string all of it labelled false and I lose data there and turns out the cells that are false stay in there as well.

I tried df[(df.iloc[:,2:] >= 10).any(1)] which is the same as the iloc method and does not remove the rows that have at least one false value.

How can I get my desired output? Please note I want to keep the the first column.

Edit: The table above is an example table, that is a scaled down version of my real table. My real table has 109 columns and is the first of many future tables. Supplying all column names by hand is not a valid solution at all and makes scripting unfeasible.

CodePudding user response:

You can do this to check that all values in the row are greater than 10:

import pandas as pd

df = pd.DataFrame({
    'city': ['A','B','C','D'],
    'one': [7,11,20,8],
    'two': [11,12,21,6],
    'three': [13,13,22,4],
    'four': [16,14,23,5],
})

print(df[(df[['one', 'two', 'three']]>10).all(axis=1)])

Output is:

  city  one  two  three  four
1    B   11   12     13    14
2    C   20   21     22    23

You can also modify it to filter only in those columns with integer dtype, as follow:

print(df[(
    df[[c for c, t in zip(df.columns, df.dtypes) if pd.api.types.is_integer_dtype(t)]]>10
).all(axis=1)])

with exactly the same format. However, you may need to tweak a little bit according to the dtypes in your DataFrame.

Assuming your only non-numeric column is the first, then you can simplify it as follow:

print(df[(df[df.columns[1:]]>10).all(axis=1)])

CodePudding user response:

df[(df[df.columns[1:]]>x).all(axis=1)] where x should be replaced with the values one wants to test turns out to be the easiest answer for me. This makes it possible to parse the dataframe without having to manually type out the column names. This also assumes that all of your columns other than the first one are integers. Please make note of the other answer above that tells you how to make note of dtypes if you have mixed data.

I only slightly changed Rodrigo Laguna answer above.

  • Related