Home > Back-end >  how to find which rows have more than certain columns with values (e.g. 20 columns) in Panda?
how to find which rows have more than certain columns with values (e.g. 20 columns) in Panda?

Time:07-01

I have the e.g. [100,30] data frame, I would like to find which rows have more than 20 columns with values? Well all rows have 30 columns, but some of them have NaN values, therefore I set a limit of 20 columns and I wanna find which rows have values for the column after col.iloc = 20

For example, even though at row number 05 we have more Nan value, due to distribution I would like to find which rows have more than 3 columns either with the value of X or Nan (in the table below I want to find the index of rows number 1, 3, 4 and 7. because they have value for the columns after index.column=3, and my question is among the large data set how should I find them).

ID Col.01 Col.02 Col.03 Col.04 Col.05
01 X X X X X
02 X X X Nan Nan
03 X X X X X
04 X X X X Nan
05 X Nan Nan Nan Nan
06 X X X Nan Nan
07 X X X X Nan

My expected result:

I find out that the ID of the row: 1,3,4,7 are the rows have values in column>3, so I can delete them after that

ID Col.01 Col.02 Col.03 Col.04 Col.05
02 X X X Nan Nan
05 X Nan Nan Nan Nan
06 X X X Nan Nan

but in large dataframe It would not that easy

CodePudding user response:

You can use

df.column.isna().sum() 

to find the number of NaN values in a column. Refer to this answer.

CodePudding user response:

Try

out = df[df.iloc[:, 4:].isna().all(axis=1)]
print(out)

# Output
   ID Col.01 Col.02 Col.03 Col.04 Col.05
1  02      X      X      X    NaN    NaN
4  05      X    NaN    NaN    NaN    NaN
5  06      X      X      X    NaN    NaN

If your columns have names, you can use df.loc[:, 'Col.04':] instead of df.iloc[:, 4:].

  • Related