Home > Software engineering >  Pandas filter rows based on certain number of certain columns being NaN
Pandas filter rows based on certain number of certain columns being NaN

Time:05-05

I have a data set like this:

seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
A   NaN       4       5       NaN     4       7       NaN       6       8
B   7         2       9       2       1       9       2         1       1 
C   NaN       4       4       2       4       NaN     2         6       8
D   5         NaN     NaN     2       5       9       NaN       1       1 

I want to remove the rows where at least three of the columns marked 'T01' are NaN

So the output would be:

seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
B   7         2       9       2       1       9       2         1       1 
C   NaN       4       4       2       4       NaN     2         6       8
D   5         NaN     NaN     2       5       9       NaN       1       1 

Because the A row there is NaN is S01-T01, S02-T02, S03-T01. Row D also has three NaNs, but it is kept in because I am only interested in removing the rows if specifically there is >=3 NaN in the column names that have a T01 in them.

I know this could be simple to do, I wrote:

import sys
import pandas as pd

df = pd.read_csv('data.csv',sep=',')
print(df.columns.str.contains['T01'])

To first get all of the cells with T01 in them, and then I was going to count them.

I got the error:

    print(df.columns.str.contains['T01'])
TypeError: 'method' object is not subscriptable

Then I thought about iterating through the rows and counting instead e.g.:

for index,row in df.iterrows():
        if 'T01' in row:
                print(row)

This runs without error but prints nothing to screen. Could someone demonstrate a better way to do this?

CodePudding user response:

If you select only the 'T01' columns, you can take the rowwise sum of nulls and keep only rows that are less than 3.

df.loc[df[[x for x in df if 'T01' in x]].isnull().sum(1).lt(3)]
  • Related