Home > database >  How to get number of columns in a DataFrame row that are above threshold
How to get number of columns in a DataFrame row that are above threshold

Time:11-30

I have a simple python 3.8 DataFrame with 8 columns (simply labeled 0, 1, 2, etc.) with approx. 3500 rows. I want a subset of this DataFrame where there are at least 2 columns in each row that are above 1. I would prefer not to have to check each column individually, but be able to check all columns. I know I can use the .any(1) to check all the columns, but I need there to be at least 2 columns that meet the threshold, not just one. Any help would be appreciated. Sample code below:

import pandas as pd

df = pd.DataFrame({0:[1,1,1,1,100],
                1:[1,3,1,1,1],
                2:[1,3,1,1,4],
                3:[1,1,1,1,1],
                4:[3,4,1,1,5],
                5:[1,1,1,1,1]})

Easiest way I can think to sort/filter later would be to create another column at the end df[9] that houses the count:

df[9] = df.apply(lambda x: x.count() if x > 2, axis=1)

This code doesn't work, but I feel like it's close?

CodePudding user response:

The value of x in the lambda is a Series, which can be indexed like this.

df[9] = df.apply(lambda x: x[x > 2].count(), axis=1)

CodePudding user response:

df[(df>1).sum(axis=1)>=2]

Explanation:

  • (df>1).sum(axis=1) gives the number of columns in that row that is greater than 1.
  • then with >=2 we filter those rows with at least 2 columns that meet the condition --which we counted as explained in the previous bullet
  • Related