I have the following problem, from a 5 column dataframe I need to assign a label to it but only if any of the columns has more than 2 non nan values, for example for the following dataframe:
Name Orange Blue Red Black
A nan nan ok nan
A nan nan nan nan
A nan ok ok nan
A ok nan ok ok
In this case A would have to be label as Red, I tried with the elimination of the columns with only nan values:
df[df[~df.isnull().all()]]
but i can't assign the label properly, any idea to optimize the process? the output can be a df with the column with more non nan vales like:
Name Red
A ok
A nan
A ok
A ok
CodePudding user response:
Here is one way to do it by using notna
and boolean indexing :
N = 2 # <- adjust here the threshold
sl = df.set_index("Name").notna().sum().loc[lambda s: s.gt(N)]
out = df[["Name"]].join(df.iloc[:, sl])
Another variant (better one) :
out = df.loc[:, df.notna().sum() >= N]
Output :
print(out)
Name Red
0 A ok
1 A NaN
2 A ok
3 A ok