I have a simple python 3.8 DataFrame with 8 columns (simply labeled 0, 1, 2, etc.) with approx. 3500 rows. I want a subset of this DataFrame where there are at least 2 columns in each row that are above 1. I would prefer not to have to check each column individually, but be able to check all columns. I know I can use the .any(1) to check all the columns, but I need there to be at least 2 columns that meet the threshold, not just one. Any help would be appreciated. Sample code below:
import pandas as pd
df = pd.DataFrame({0:[1,1,1,1,100],
1:[1,3,1,1,1],
2:[1,3,1,1,4],
3:[1,1,1,1,1],
4:[3,4,1,1,5],
5:[1,1,1,1,1]})
Easiest way I can think to sort/filter later would be to create another column at the end df[9] that houses the count:
df[9] = df.apply(lambda x: x.count() if x > 2, axis=1)
This code doesn't work, but I feel like it's close?
CodePudding user response:
The value of x
in the lambda is a Series, which can be indexed like this.
df[9] = df.apply(lambda x: x[x > 2].count(), axis=1)
CodePudding user response:
df[(df>1).sum(axis=1)>=2]
Explanation:
(df>1).sum(axis=1)
gives the number of columns in that row that is greater than 1.- then with
>=2
we filter those rows with at least 2 columns that meet the condition --which we counted as explained in the previous bullet