I have the e.g. [100,30] data frame, I would like to find which rows have more than 20 columns with values? Well all rows have 30 columns, but some of them have NaN values, therefore I set a limit of 20 columns and I wanna find which rows have values for the column after col.iloc = 20
For example, even though at row number 05 we have more Nan value, due to distribution I would like to find which rows have more than 3 columns either with the value of X or Nan (in the table below I want to find the index of rows number 1, 3, 4 and 7. because they have value for the columns after index.column=3, and my question is among the large data set how should I find them).
ID | Col.01 | Col.02 | Col.03 | Col.04 | Col.05 |
---|---|---|---|---|---|
01 | X | X | X | X | X |
02 | X | X | X | Nan | Nan |
03 | X | X | X | X | X |
04 | X | X | X | X | Nan |
05 | X | Nan | Nan | Nan | Nan |
06 | X | X | X | Nan | Nan |
07 | X | X | X | X | Nan |
My expected result:
I find out that the ID of the row: 1,3,4,7 are the rows have values in column>3, so I can delete them after that
ID | Col.01 | Col.02 | Col.03 | Col.04 | Col.05 |
---|---|---|---|---|---|
02 | X | X | X | Nan | Nan |
05 | X | Nan | Nan | Nan | Nan |
06 | X | X | X | Nan | Nan |
but in large dataframe It would not that easy
CodePudding user response:
You can use
df.column.isna().sum()
to find the number of NaN values in a column. Refer to this answer.
CodePudding user response:
Try
out = df[df.iloc[:, 4:].isna().all(axis=1)]
print(out)
# Output
ID Col.01 Col.02 Col.03 Col.04 Col.05
1 02 X X X NaN NaN
4 05 X NaN NaN NaN NaN
5 06 X X X NaN NaN
If your columns have names, you can use df.loc[:, 'Col.04':]
instead of df.iloc[:, 4:]
.